diff --git "a/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication" "b/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication" new file mode 100644 index 0000000000..324641b552 --- /dev/null +++ "b/data/2024/aaai/\"Allot?\" is \"A Lot!\" Towards Developing More Generalized Speech Recognition System for Accessible Communication" @@ -0,0 +1 @@ +The proliferation of Automatic Speech Recognition (ASR) systems has revolutionized translation and transcription. However, challenges persist in ensuring inclusive communication for non-native English speakers. This study quantifies the gap between accented and native English speech using Wav2Vec 2.0, a state-of-the-art transformer model. Notably, we found that accented speech exhibits significantly higher word error rates of 30-50%, in contrast to native speakers’ 2-8% (Baevski et al. 2020). Our exploration extends to leveraging accessible online datasets to highlight the potential of enhancing speech recognition by fine-tuning the Wav2Vec 2.0 model. Through experimentation and analysis, we highlight the challenges with training models on accented speech. By refining models and addressing data quality issues, our work presents a pipeline for future investigations aimed at developing an integrated system capable of effectively engaging with a broader range of individuals with diverse backgrounds. Accurate recognition of accented speech is a pivotal step toward democratizing AI-driven communication products. \ No newline at end of file diff --git a/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation b/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation new file mode 100644 index 0000000000..b11f828630 --- /dev/null +++ b/data/2024/aaai/'Why Didn't You Allocate This Task to Them?' Negotiation-Aware Task Allocation and Contrastive Explanation Generation @@ -0,0 +1 @@ +In this work, we design an Artificially Intelligent Task Allocator (AITA) that proposes a task allocation for a team of humans. A key property of this allocation is that when an agent with imperfect knowledge (about their teammate's costs and/or the team's performance metric) contests the allocation with a counterfactual, a contrastive explanation can always be provided to showcase why the proposed allocation is better than the proposed counterfactual. For this, we consider a negotiation process that produces a negotiation-aware task allocation and, when contested, leverages a negotiation tree to provide a contrastive explanation. With human subject studies, we show that the proposed allocation indeed appears fair to a majority of participants and, when not, the explanations generated are judged as convincing and easy to comprehend. \ No newline at end of file diff --git a/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations b/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations new file mode 100644 index 0000000000..ba1f6fe236 --- /dev/null +++ b/data/2024/aaai/1 2-Approximate MMS Allocation for Separable Piecewise Linear Concave Valuations @@ -0,0 +1,6 @@ +We study fair distribution of a collection of m indivisible goods among a group of n agents, using the widely recognized fairness principles of Maximin Share (MMS) and Any Price Share (APS). These principles have undergone thorough investigation within the context of additive valuations. We explore these notions for valuations that extend beyond additivity. + +First, we study approximate MMS under the separable (piecewise-linear) concave (SPLC) valuations, an important class generalizing additive, where the best known factor was 1/3-MMS. We show that 1/2-MMS allocation exists and can be computed in polynomial time, significantly improving the state-of-the-art. +We note that SPLC valuations introduce an elevated level of intricacy in contrast to additive. For instance, the MMS value of an agent can be as high as her value for the entire set of items. We use a relax-and-round paradigm that goes through competitive equilibrium and LP relaxation. Our result extends to give (symmetric) 1/2-APS, a stronger guarantee than MMS. + +APS is a stronger notion that generalizes MMS by allowing agents with arbitrary entitlements. We study the approximation of APS under submodular valuation functions. We design and analyze a simple greedy algorithm using concave extensions of submodular functions. We prove that the algorithm gives a 1/3-APS allocation which matches the best-known factor. Concave extensions are hard to compute in polynomial time and are, therefore, generally not used in approximation algorithms. Our approach shows a way to utilize it within analysis (while bypassing its computation), and hence might be of independent interest. \ No newline at end of file diff --git a/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands b/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands new file mode 100644 index 0000000000..f423feec20 --- /dev/null +++ b/data/2024/aaai/3D Visibility-Aware Generalizable Neural Radiance Fields for Interacting Hands @@ -0,0 +1 @@ +Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: https://github.com/XuanHuang0/VANeRF. \ No newline at end of file diff --git a/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation b/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation new file mode 100644 index 0000000000..8fee1b9934 --- /dev/null +++ b/data/2024/aaai/3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation @@ -0,0 +1 @@ +In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only sets new performance standards, registering an mIoU gain of 11.7 points but also achieves a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN. \ No newline at end of file diff --git a/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition b/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition new file mode 100644 index 0000000000..841d1f34e2 --- /dev/null +++ b/data/2024/aaai/A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition @@ -0,0 +1 @@ +The human brain can effortlessly and reliably perceive emotions, whereas existing facial emotion recognition (FER) methods suffer from drawbacks such as complex model structures, high storage requirements, and poor interpretability. Inspired by the role of emotion concepts in visual perception coding within the human brain, we propose a dual-pathway framework emulating the neural computation of emotion recognition. Specifically, these two pathways are designed to model the representation of emotion concepts in the brain and the visual perception process, respectively. For the former, we adopt a disentangled approach to extract emotion concepts from complex facial geometric attributes; for the latter, we employ an emotional confidence evaluation strategy to determine which concept is optimal for regularizing the perceptual coding. The proposed concept-regularized coding strategy endows the framework with flexibility and interpretability as well as good performances on several benchmarking FER datasets. \ No newline at end of file diff --git a/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems b/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems new file mode 100644 index 0000000000..4ed4d1a961 --- /dev/null +++ b/data/2024/aaai/A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems @@ -0,0 +1 @@ +In this paper, we explore a specific optimization problem that involves the combination of a differentiable nonconvex function and a nondifferentiable function. The differentiable component lacks a global Lipschitz continuous gradient, posing challenges for optimization. To address this issue and accelerate the convergence, we propose a Bregman proximal stochastic gradient method with extrapolation (BPSGE), which only requires smooth adaptivity of the differentiable part. Under variance reduction framework, we not only analyze the subsequential and global convergence of the proposed algorithm under certain conditions, but also analyze the sublinear convergence rate of the subsequence, and the complexity of the algorithm, revealing that the BPSGE algorithm requires at most O(epsilon\^\,(-2)) iterations in expectation to attain an epsilon-stationary point. To validate the effectiveness of our proposed algorithm, we conduct numerical experiments on three real-world applications: graph regularized nonnegative matrix factorization (NMF), matrix factorization with weakly-convex regularization, and NMF with nonconvex sparsity constraints. These experiments demonstrate that BPSGE is faster than the baselines without extrapolation. The code is available at: https://github.com/nothing2wang/BPSGE-Algorithm. \ No newline at end of file diff --git a/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science b/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science new file mode 100644 index 0000000000..d1b2ddf904 --- /dev/null +++ b/data/2024/aaai/A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science @@ -0,0 +1 @@ +This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments. \ No newline at end of file diff --git a/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams b/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams new file mode 100644 index 0000000000..3ff31a8352 --- /dev/null +++ b/data/2024/aaai/A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams @@ -0,0 +1 @@ +Persistence diagrams (PD)s play a central role in topological data analysis, and are used in an ever increasing variety of applications. The comparison of PD data requires computing distances among large sets of PDs, with metrics which are accurate, theoretically sound, and fast to compute. Especially for denser multi-dimensional PDs, such comparison metrics are lacking. While on the one hand, Wasserstein-type distances have high accuracy and theoretical guarantees, they incur high computational cost. On the other hand, distances between vectorizations such as Persistence Statistics (PS)s have lower computational cost, but lack the accuracy guarantees and theoretical properties of a true distance over PD space. In this work we introduce a class of pseudodistances called Extended Topological Pseudodistances (ETD)s, which have tunable complexity, and can approximate Sliced and classical Wasserstein distances at the high-complexity extreme, while being computationally lighter and close to Persistence Statistics at the lower complexity extreme, and thus allow users to interpolate between the two metrics. We build theoretical comparisons to show how to fit our new distances at an intermediate level between persistence vectorizations and Wasserstein distances. We also experimentally verify that ETDs outperform PSs in terms of accuracy and outperform Wasserstein and Sliced Wasserstein distances in terms of computational complexity. \ No newline at end of file diff --git a/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective b/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective new file mode 100644 index 0000000000..02a1d53d24 --- /dev/null +++ b/data/2024/aaai/A Closer Look at Curriculum Adversarial Training: From an Online Perspective @@ -0,0 +1 @@ +Curriculum adversarial training empirically finds that gradually increasing the hardness of adversarial examples can further improve the adversarial robustness of the trained model compared to conventional adversarial training. However, theoretical understanding of this strategy remains limited. In an attempt to bridge this gap, we analyze the adversarial training process from an online perspective. Specifically, we treat adversarial examples in different iterations as samples from different adversarial distributions. We then introduce the time series prediction framework and deduce novel generalization error bounds. Our theoretical results not only demonstrate the effectiveness of the conventional adversarial training algorithm but also explain why curriculum adversarial training methods can further improve adversarial generalization. We conduct comprehensive experiments to support our theory. \ No newline at end of file diff --git a/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form b/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form new file mode 100644 index 0000000000..4550343659 --- /dev/null +++ b/data/2024/aaai/A Compiler for Weak Decomposable Negation Normal Form @@ -0,0 +1 @@ +This paper integrates weak decomposable negation normal form (wDNNF) circuits, introduced by Akshay et al. in 2018, into the knowledge compilation map. This circuit type generalises decomposable negation normal form (DNNF) circuits in such a way that they allow a restricted form of sharing variables among the inputs of a conjunction node. We show that wDNNF circuits have the same properties as DNNF circuits regarding the queries and transformations presented in the knowledge compilation map, whilst being strictly more succinct than DNNF circuits (that is, they can represent Boolean functions compactly). We also present and evaluate a knowledge compiler, called Bella, for converting CNF formulae into wDNNF circuits. Our experiments demonstrate that wDNNF circuits are suitable for configuration instances. \ No newline at end of file diff --git a/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators b/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators new file mode 100644 index 0000000000..82a62748b3 --- /dev/null +++ b/data/2024/aaai/A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators @@ -0,0 +1 @@ +Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis. \ No newline at end of file diff --git a/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection b/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection new file mode 100644 index 0000000000..a6678645bb --- /dev/null +++ b/data/2024/aaai/A Comprehensive Augmentation Framework for Anomaly Detection @@ -0,0 +1,2 @@ +Data augmentation methods are commonly integrated into the training of anomaly detection models. +Previous approaches have primarily focused on replicating real-world anomalies or enhancing diversity, without considering that the standard of anomaly varies across different classes, potentially leading to a biased training distribution. This paper analyzes crucial traits of simulated anomalies that contribute to the training of reconstructive networks and condenses them into several methods, thus creating a comprehensive framework by selectively utilizing appropriate combinations. Furthermore, we integrate this framework with a reconstruction-based approach and concurrently propose a split training strategy that alleviates the overfitting issue while avoiding introducing interference to the reconstruction process. The evaluations conducted on the MVTec anomaly detection dataset demonstrate that our method outperforms the previous state-of-the-art approach, particularly in terms of object classes. We also generate a simulated dataset comprising anomalies with diverse characteristics, and experimental results demonstrate that our approach exhibits promising potential for generalizing effectively to various unseen anomalies encountered in real-world scenarios. \ No newline at end of file diff --git a/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion b/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion new file mode 100644 index 0000000000..535f1c2671 --- /dev/null +++ b/data/2024/aaai/A Computation-Aware Shape Loss Function for Point Cloud Completion @@ -0,0 +1,3 @@ +Learning-based point cloud completion tasks have shown potential in various critical tasks, such as object detection, assignment, and registration. However, accurately and efficiently quantifying the shape error between the predicted point clouds generated by networks and the ground truth remains challenging. While EMD-based loss functions excel in shape detail and perceived density distribution, their approach can only yield results with significant discrepancies from the actual EMD within a tolerable training time. +To address these challenges, we first propose the initial price based on the auction algorithm, reducing the number of iterations required for the algorithm while ensuring the correctness of the assignment results. We then introduce an algorithm to compute the initial price through a successive shortest path and the Euclidean information between its nodes. Finally, we adopt a series of optimization strategies to speed up the algorithm and offer an EMD approximation scheme for point cloud problems that balances time loss and computational accuracy based on point cloud data characteristics. +Our experimental results confirm that our algorithm achieves the smallest gap with the real EMD within an acceptable time range and yields the best results in end-to-end training. \ No newline at end of file diff --git a/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation b/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation new file mode 100644 index 0000000000..36e96a75ef --- /dev/null +++ b/data/2024/aaai/A Convolutional Neural Network Interpretable Framework for Human Ventral Visual Pathway Representation @@ -0,0 +1 @@ +Recently, convolutional neural networks (CNNs) have become the best quantitative encoding models for capturing neural activity and hierarchical structure in the ventral visual pathway. However, the weak interpretability of these black-box models hinders their ability to reveal visual representational encoding mechanisms. Here, we propose a convolutional neural network interpretable framework (CNN-IF) aimed at providing a transparent interpretable encoding model for the ventral visual pathway. First, we adapt the feature-weighted receptive field framework to train two high-performing ventral visual pathway encoding models using large-scale functional Magnetic Resonance Imaging (fMRI) in both goal-driven and data-driven approaches. We find that network layer-wise predictions align with the functional hierarchy of the ventral visual pathway. Then, we correspond feature units to voxel units in the brain and successfully quantify the alignment between voxel responses and visual concepts. Finally, we conduct Network Dissection along the ventral visual pathway including the fusiform face area (FFA), and discover variations related to the visual concept of `person'. Our results demonstrate the CNN-IF provides a new perspective for understanding encoding mechanisms in the human ventral visual pathway, and the combination of ante-hoc interpretable structure and post-hoc interpretable approaches can achieve fine-grained voxel-wise correspondence between model and brain. The source code is available at: https://github.com/BIT-YangLab/CNN-IF. \ No newline at end of file diff --git a/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction b/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction new file mode 100644 index 0000000000..a177ed3814 --- /dev/null +++ b/data/2024/aaai/A Cross-View Hierarchical Graph Learning Hypernetwork for Skill Demand-Supply Joint Prediction @@ -0,0 +1,2 @@ +The rapidly changing landscape of technology and industries leads to dynamic skill requirements, making it crucial for employees and employers to anticipate such shifts to maintain a competitive edge in the labor market. Existing efforts in this area either relies on domain-expert knowledge or regarding the skill evolution as a simplified time series forecasting problem. However, both approaches overlook the sophisticated relationships among different skills and the inner-connection between skill demand and supply variations. +In this paper, we propose a Cross-view Hierarchical Graph learning Hypernetwork (CHGH) framework for joint skill demand-supply prediction. Specifically, CHGH is an encoder-decoder network consisting of i) a cross-view graph encoder to capture the interconnection between skill demand and supply, ii) a hierarchical graph encoder to model the co-evolution of skills from a cluster-wise perspective, and iii) a conditional hyper-decoder to jointly predict demand and supply variations by incorporating historical demand-supply gaps. Extensive experiments on three real-world datasets demonstrate the superiority of the proposed framework compared to seven baselines and the effectiveness of the three modules. \ No newline at end of file diff --git a/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging b/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging new file mode 100644 index 0000000000..ba9f3064bb --- /dev/null +++ b/data/2024/aaai/A Diffusion Model with State Estimation for Degradation-Blind Inverse Imaging @@ -0,0 +1 @@ +Solving the task of inverse imaging problems can restore unknown clean images from input measurements that have incomplete information. Utilizing powerful generative models, such as denoising diffusion models, could better tackle the ill-posed issues of inverse problems with the distribution prior of the unknown clean images. We propose a learnable state-estimator-based diffusion model to incorporate the measurements into the reconstruction process. Our method makes efficient use of the pre-trained diffusion models with computational feasibility compared to the conditional diffusion models, which need to be trained from scratch. In addition, our pipeline does not require explicit knowledge of the image degradation operator or make the assumption of its form, unlike many other works that use the pre-trained diffusion models at the test time. The experiments on three typical inverse imaging problems (both linear and non-linear), inpainting, deblurring, and JPEG compression restoration, have comparable results with the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection b/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection new file mode 100644 index 0000000000..94e4e689ae --- /dev/null +++ b/data/2024/aaai/A Diffusion-Based Framework for Multi-Class Anomaly Detection @@ -0,0 +1 @@ +Reconstruction-based approaches have achieved remarkable outcomes in anomaly detection. The exceptional image reconstruction capabilities of recently popular diffusion models have sparked research efforts to utilize them for enhanced reconstruction of anomalous images. Nonetheless, these methods might face challenges related to the preservation of image categories and pixel-wise structural integrity in the more practical multi-class setting. To solve the above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection, which consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion’s denoising network, and a feature-space pre-trained feature extractor. Firstly, The SG network is proposed for reconstructing anomalous regions while preserving the original image’s semantic information. Secondly, we introduce Spatial-aware Feature Fusion (SFF) block to maximize reconstruction accuracy when dealing with extensively reconstructed areas. Thirdly, the input and reconstructed images are processed by a pre-trained feature extractor to generate anomaly maps based on features extracted at different scales. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach which surpasses the state-of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and detection respectively on multi-class MVTec-AD dataset. Code will be available at https://lewandofskee.github.io/projects/diad. \ No newline at end of file diff --git a/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction b/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction new file mode 100644 index 0000000000..549c5309a2 --- /dev/null +++ b/data/2024/aaai/A Diffusion-Based Pre-training Framework for Crystal Property Prediction @@ -0,0 +1 @@ +Many significant problems involving crystal property prediction from 3D structures have limited labeled data due to expensive and time-consuming physical simulations or lab experiments. To overcome this challenge, we propose a pretrain-finetune framework for the crystal property prediction task named CrysDiff based on diffusion models. In the pre-training phase, CrysDiff learns the latent marginal distribution of crystal structures via the reconstruction task. Subsequently, CrysDiff can be fine-tuned under the guidance of the new sparse labeled data, fitting the conditional distribution of the target property given the crystal structures. To better model the crystal geometry, CrysDiff notably captures the full symmetric properties of the crystals, including the invariance of reflection, rotation, and periodic translation. Extensive experiments demonstrate that CrysDiff can significantly improve the performance of the downstream crystal property prediction task on multiple target properties, outperforming all the SOTA pre-training models for crystals with good margins on the popular JARVIS-DFT dataset. \ No newline at end of file diff --git a/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives b/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives new file mode 100644 index 0000000000..f377013899 --- /dev/null +++ b/data/2024/aaai/A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives @@ -0,0 +1 @@ +Backdoor attacks pose serious security threats to deep neural networks (DNNs). Backdoored models make arbitrarily (targeted) incorrect predictions on inputs containing well-designed triggers, while behaving normally on clean inputs. Prior researches have explored the invisibility of backdoor triggers to enhance attack stealthiness. However, most of them only focus on the invisibility in the spatial domain, neglecting the generation of invisible triggers in the frequency domain. This limitation renders the generated poisoned images easily detectable by recent defense methods. To address this issue, we propose a DUal stealthy BAckdoor attack method named DUBA, which simultaneously considers the invisibility of triggers in both the spatial and frequency domains, to achieve desirable attack performance, while ensuring strong stealthiness. Specifically, we first use Wavelet Transform to embed the high-frequency information of the trigger image into the clean image to ensure attack effectiveness. Then, to attain strong stealthiness, we incorporate Fourier Transform and Cosine Transform to mix the poisoned image and clean image in the frequency domain. Moreover, DUBA adopts a novel attack strategy, training the model with weak triggers and attacking with strong triggers to further enhance attack performance and stealthiness. DUBA is evaluated extensively on four datasets against popular image classifiers, showing significant superiority over state-of-the-art backdoor attacks in attack success rate and stealthiness. \ No newline at end of file diff --git a/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning b/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning new file mode 100644 index 0000000000..6729953e2d --- /dev/null +++ b/data/2024/aaai/A Dynamic GCN with Cross-Representation Distillation for Event-Based Learning @@ -0,0 +1 @@ +Recent advances in event-based research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks. \ No newline at end of file diff --git a/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning b/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning new file mode 100644 index 0000000000..a391025c76 --- /dev/null +++ b/data/2024/aaai/A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning @@ -0,0 +1 @@ +To tackle the challenge of recognizing images of unseen attribute-object compositions, Compositional Zero-Shot Learning (CZSL) methods have been previously addressed. However, test images in realistic scenarios may also incorporate other forms of unknown factors, such as novel semantic concepts or novel image styles. As previous CZSL works have overlooked this critical issue, in this research, we first propose the Realistic Compositional Zero-Shot Learning (RCZSL) task which considers the various types of unknown factors in an unified experimental setting. To achieve this, we firstly conduct re-labelling on MIT-States and use the pre-trained generative models to obtain images of various domains. Then the entire dataset is split into a training set and a test set, with the latter containing images of unseen concepts, unseen compositions, unseen domains as well as their combinations. Following this, we show that the visual-semantic relationship changes on unseen images, leading us to construct two dynamic modulators to adapt the visual features and composition prototypes in accordance with the input image. We believe that such a dynamic learning method could effectively alleviate the domain shift problem caused by various types of unknown factors. We conduct extensive experiments on benchmark datasets for both the conventional CZSL setting and the proposed RCZSL setting. The effectiveness of our method has been proven by empirical results, which significantly outperformed both our baseline method and state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem b/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem new file mode 100644 index 0000000000..fc7cdcc55c --- /dev/null +++ b/data/2024/aaai/A Fast Exact Solver with Theoretical Analysis for the Maximum Edge-Weighted Clique Problem @@ -0,0 +1,5 @@ +The maximum vertex-weighted clique problem (MVWCP) and the maximum edge-weighted clique problem (MEWCP) are two natural extensions of the fundamental maximum clique problem. +In this paper, we systematically study MEWCP and make the following major contributions: +(1) We show that MEWCP is NP-hard even when the minimum degree of the graph is n-2, in contrast to MVWCP which is polynomial-time solvable when the minimum degree of the graph is at least n-3. This result distinguishes the complexity of the two problems for the first time. +(2) To address MEWCP, we develop an efficient branch-and-bound algorithm called MEWCat with both practical and theoretical performance guarantees. In practice, MEWCat utilizes a new upper bound tighter than existing ones, which allows for more efficient pruning of branches. In theory, we prove a running-time bound of O*(1.4423^n) for MEWCat, which breaks the trivial bound of O*(2^n) in the research line of practical exact MEWCP solvers for the first time. +(3) Empirically, we evaluate the performance of MEWCat on various benchmark instances. The experiments demonstrate that MEWCat outperforms state-of-the-art exact solvers significantly. For instance, on 16 DIMACS graphs that the state-of-the-art solver BBEWC fails to solve within 7200 seconds, MEWCat solves all of them with an average time of less than 1000 seconds. On real-world graphs, MEWCat achieves an average speedup of over 36x. \ No newline at end of file diff --git a/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton b/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton new file mode 100644 index 0000000000..af582724cd --- /dev/null +++ b/data/2024/aaai/A Fixed-Parameter Tractable Algorithm for Counting Markov Equivalence Classes with the Same Skeleton @@ -0,0 +1,11 @@ +Causal DAGs (also known as Bayesian networks) are a popular tool for encoding +conditional dependencies between random variables. In a causal DAG, the random +variables are modeled as vertices in the DAG, and it is stipulated that every +random variable is independent of its non-descendants conditioned on its parents. It +is possible, however, for two different causal DAGs on the same set of random +variables to encode exactly the same set of conditional dependencies. Such +causal DAGs are said to be Markov equivalent, and equivalence classes of +Markov equivalent DAGs are known as Markov Equivalent Classes (MECs). +Beautiful combinatorial characterizations of MECs have been developed in the +past few decades, and it is known, in particular, that all DAGs in the same MEC +must have the same skeleton (underlying undirected graph) and v-structures (induced subgraph of the form a->b \ No newline at end of file diff --git a/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting b/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting new file mode 100644 index 0000000000..de12da1d4c --- /dev/null +++ b/data/2024/aaai/A Fixed-Point Approach to Unified Prompt-Based Counting @@ -0,0 +1 @@ +Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks. \ No newline at end of file diff --git a/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs b/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs new file mode 100644 index 0000000000..185254a84a --- /dev/null +++ b/data/2024/aaai/A Framework for Approaching AI Education in Educator Preparation Programs @@ -0,0 +1 @@ +In recent years, the rapid advancement of artificial intelligence (AI) has fostered an urgent need to better prepare current and future educators to be able to integrate AI technologies in their teaching and to teach AI literacy to PreK-12 students. While many organizations have developed professional learning opportunities for inservice educators, a gap remains for resources specifically designed for those facilitating and enrolled in Educator Preparation Programs (EPPs). In response to this gap, the International Society for Technology in Education (ISTE) launched its first AI Explorations for EPPs Faculty Fellowship. As a result of the Faculty Fellows’ collaboration, this paper articulates a framework of seven critical strategies with the potential to address the urgent need EPPs have in preparing preservice teachers to effectively integrate AI-powered instructional tools and to teach this new area of content knowledge in PreK-12 classrooms. In addition, we provide a review of literature and an overview of the emerging needs for integrating AI education in EPPs. We demonstrate why support for preservice teachers’ critical examination and application of AI, including a focus on the issues of equity, ethics, and culturally responsive teaching, is essential to their later success in PreK-12 classrooms. Recommendations for further research and learning are also provided to promote community-wide initiatives for supporting the integration of AI in education through Educator Preparation Programs and beyond. \ No newline at end of file diff --git a/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization b/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization new file mode 100644 index 0000000000..fe7306d0ab --- /dev/null +++ b/data/2024/aaai/A Framework for Data-Driven Explainability in Mathematical Optimization @@ -0,0 +1 @@ +Advancements in mathematical programming have made it possible to efficiently tackle large-scale real-world problems that were deemed intractable just a few decades ago. However, provably optimal solutions may not be accepted due to the perception of optimization software as a black box. Although well understood by scientists, this lacks easy accessibility for practitioners. Hence, we advocate for introducing the explainability of a solution as another evaluation criterion, next to its objective value, which enables us to find trade-off solutions between these two criteria. Explainability is attained by comparing against (not necessarily optimal) solutions that were implemented in similar situations in the past. Thus, solutions are preferred that exhibit similar features. Although we prove that already in simple cases the explainable model is NP-hard, we characterize relevant polynomially solvable cases such as the explainable shortest path problem. Our numerical experiments on both artificial as well as real-world road networks show the resulting Pareto front. It turns out that the cost of enforcing explainability can be very small. \ No newline at end of file diff --git a/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation b/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation new file mode 100644 index 0000000000..9aacd43540 --- /dev/null +++ b/data/2024/aaai/A Framework for Mining Speech-to-Text Transcripts of the Customer for Automated Problem Remediation @@ -0,0 +1,17 @@ +Technical support services get several thousand voice calls +every year. These calls vary across a range of technical issues +or maintenance requests for a suite of hardware and software +products. On receiving the call, a support agent creates a ser- +vice request artifact that contains her interpretation of the +customer’s problem. This service request goes through the life +cycle of the problem remediation process with the resolution +also being recorded as part of the service request. It has been +empirically observed that the actual complaint voiced by the +customer is often different from the recorded interpretation +in the service request. The service request created by sup- +port agents runs the risk of missing key information elements +present in the customer voice records. In this paper, we build +a framework that taps into voice calls and uses unsupervised +and supervised learning methods to enrich the service requests +with additional information. The enriched data is then used +for automated problem resolution. \ No newline at end of file diff --git a/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering b/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering new file mode 100644 index 0000000000..d46e6b7844 --- /dev/null +++ b/data/2024/aaai/A General Implicit Framework for Fast NeRF Composition and Rendering @@ -0,0 +1 @@ +A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works. \ No newline at end of file diff --git a/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint) b/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint) new file mode 100644 index 0000000000..01f50d0c00 --- /dev/null +++ b/data/2024/aaai/A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint) @@ -0,0 +1,5 @@ +Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves. + +This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. + +Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work. \ No newline at end of file diff --git a/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations b/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations new file mode 100644 index 0000000000..55579d7c39 --- /dev/null +++ b/data/2024/aaai/A General Search-Based Framework for Generating Textual Counterfactual Explanations @@ -0,0 +1,5 @@ +One of the prominent methods for explaining the decision of a machine-learning classifier is by a counterfactual example. +Most current algorithms for generating such examples in the textual domain are based on generative language models. Generative models, however, are trained to minimize a specific loss function in order to fulfill certain requirements for the generated texts. Any change in the requirements may necessitate costly retraining, thus potentially limiting their applicability. +In this paper, we present a general search-based framework for generating counterfactual explanations in the textual domain. +Our framework is model-agnostic, domain-agnostic, anytime, and does not require retraining in order to adapt to changes in the user requirements. +We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in a given target class. Our framework includes domain-independent modification operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the target class with minimal user-specified distance from the original classified object. \ No newline at end of file diff --git a/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models b/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models new file mode 100644 index 0000000000..c82f42d2e8 --- /dev/null +++ b/data/2024/aaai/A General Theoretical Framework for Learning Smallest Interpretable Models @@ -0,0 +1 @@ +We develop a general algorithmic framework that allows us to obtain fixed-parameter tractability for computing smallest symbolic models that represent given data. Our framework applies to all ML model types that admit a certain extension property. By showing this extension property for decision trees, decision sets, decision lists, and binary decision diagrams, we obtain that minimizing these fundamental model types is fixed-parameter tractable. Our framework even applies to ensembles, which combine individual models by majority decision. \ No newline at end of file diff --git a/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration b/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration new file mode 100644 index 0000000000..aef6995d86 --- /dev/null +++ b/data/2024/aaai/A Generalizable Theory-Driven Agent-Based Framework to Study Conflict-Induced Forced Migration @@ -0,0 +1 @@ +Large-scale population displacements arising from conflict-induced forced migration generate uncertainty and introduce several policy challenges. Addressing these concerns requires an interdisciplinary approach that integrates knowledge from both computational modeling and social sciences. We propose a generalized computational agent-based modeling framework grounded by Theory of Planned Behavior to model conflict-induced migration outflows within Ukraine during the start of that conflict in 2022. Existing migration modeling frameworks that attempt to address policy implications primarily focus on destination while leaving absent a generalized computational framework grounded by social theory focused on the conflict-induced region. We propose an agent-based framework utilizing a spatiotemporal gravity model and a Bi-threshold model over a Graph Dynamical System to update migration status of agents in conflict-induced regions at fine temporal and spatial granularity. This approach significantly outperforms previous work when examining the case of Russian invasion in Ukraine. Policy implications of the proposed framework are demonstrated by modeling the migration behavior of Ukrainian civilians attempting to flee from regions encircled by Russian forces. We also showcase the generalizability of the model by simulating a past conflict in Burundi, an alternative conflict setting. Results demonstrate the utility of the framework for assessing conflict-induced migration in varied settings as well as identifying vulnerable civilian populations. \ No newline at end of file diff --git a/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs b/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs new file mode 100644 index 0000000000..c81ce2150d --- /dev/null +++ b/data/2024/aaai/A Generalized Neural Diffusion Framework on Graphs @@ -0,0 +1 @@ +Recent studies reveal the connection between GNNs and the diffusion process, which motivates many diffusion based GNNs to be proposed. However, since these two mechanisms are closely related, one fundamental question naturally arises: Is there a general diffusion framework that can formally unify these GNNs? The answer to this question can not only deepen our understanding of the learning process of GNNs, but also may open a new door to design a broad new class of GNNs. In this paper, we propose a general diffusion equation framework with the fidelity term, which formally establishes the relationship between the diffusion process with more GNNs. Meanwhile, with this framework, we identify one characteristic of graph diffusion networks, i.e., the current neural diffusion process only corresponds to the first-order diffusion equation. However, by an experimental investigation, we show that the labels of high-order neighbors actually appear monophily property, which induces the similarity based on labels among high-order neighbors without requiring the similarity among first-order neighbors. This discovery motives to design a new high-order neighbor-aware diffusion equation, and derive a new type of graph diffusion network (HiD-Net) based on the framework. With the high-order diffusion equation, HiD-Net is more robust against attacks and works on both homophily and heterophily graphs. We not only theoretically analyze the relation between HiD-Net with high-order random walk, but also provide a theoretical convergence guarantee. Extensive experimental results well demonstrate the effectiveness of HiD-Net over state-of-the-art graph diffusion networks. \ No newline at end of file diff --git a/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility b/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility new file mode 100644 index 0000000000..3d472df1b9 --- /dev/null +++ b/data/2024/aaai/A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility @@ -0,0 +1,12 @@ +The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. +It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging. +However, most existing studies have focused on the shuffle model based on +(ε0,0)-Locally Differentially Private (LDP) randomizers, with limited consideration for complex scenarios such as (ε0,δ0)-LDP or personalized LDP (PLDP). +This hinders a comprehensive understanding of the shuffle model's potential and limits its application in various settings. +To bridge this research gap, we propose a generalized shuffle framework that can be applied to PLDP setting. This generalization allows for a broader exploration of the privacy-utility trade-off and facilitates the design of privacy-preserving analyses in diverse contexts. +We prove that the shuffled PLDP process approximately preserves μ-Gaussian Differential Privacy with +μ = O(1/√n). +This approach allows us to avoid the limitations and potential inaccuracies associated with inequality estimations. +To strengthen the privacy guarantee, we improve the lower bound by utilizing hypothesis testing instead of relying on rough estimations like the Chernoff bound or Hoeffding's inequality. +Furthermore, extensive comparative evaluations clearly show that our approach outperforms existing methods in achieving strong central privacy guarantees while preserving the utility of the global model. +We have also carefully designed corresponding algorithms for average function, frequency estimation, and stochastic gradient descent. \ No newline at end of file diff --git a/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation b/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation new file mode 100644 index 0000000000..53d9783797 --- /dev/null +++ b/data/2024/aaai/A Goal Interaction Graph Planning Framework for Conversational Recommendation @@ -0,0 +1 @@ +Multi-goal conversational recommender system (MG-CRS) which is more in line with realistic scenarios has attracted a lot of attention. MG-CRS can dynamically capture the demands of users in conversation, continuously engage their interests, and make recommendations. The key of accomplishing these tasks is to plan a reasonable goal sequence which can naturally guide the user to accept the recommended goal. Previous works have demonstrated that mining the correlations of goals from the goal sequences in the dialogue corpus is helpful for recommending the goal that the user is interested in. However, they independently model correlations for each level of goal (i.e., goal type or entity) and neglect the order of goals appear in the dialogue. In this paper, we propose a goal interaction graph planning framework which constructs a directed heterogeneous graph to flexibly model the correlations between any level of goals and retain the order of goals. We design a goal interaction graph learning module to model the goal correlations and propagate goal representations via directed edges, then use an encoder and a dual-way fusion decoder to extract the most relevant information with the current goal from the conversation and domain knowledge, making the next-goal prediction fully exploit the prior goal correlations and user feedback. Finally we generate engaging responses based on the predicted goal sequence to complete the recommendation task. Experiments on two benchmark datasets show that our method achieves significant improvements in both the goal planning and response generation tasks. \ No newline at end of file diff --git a/data/2024/aaai/A Graph Dynamics Prior for Relational Inference b/data/2024/aaai/A Graph Dynamics Prior for Relational Inference new file mode 100644 index 0000000000..f1ca122114 --- /dev/null +++ b/data/2024/aaai/A Graph Dynamics Prior for Relational Inference @@ -0,0 +1 @@ +Relational inference aims to identify interactions between parts of a dynamical system from the observed dynamics. Current state-of-the-art methods fit the dynamics with a graph neural network (GNN) on a learnable graph. They use one-step message-passing GNNs---intuitively the right choice since non-locality of multi-step or spectral GNNs may confuse direct and indirect interactions. But the effective interaction graph depends on the sampling rate and it is rarely localized to direct neighbors, leading to poor local optima for the one-step model. In this work, we propose a graph dynamics prior (GDP) for relational inference. GDP constructively uses error amplification in non-local polynomial filters to steer the solution to the ground-truth graph. To deal with non-uniqueness, GDP simultaneously fits a ``shallow'' one-step model and a polynomial multi-step model with shared graph topology. Experiments show that GDP reconstructs graphs far more accurately than earlier methods, with remarkable robustness to under-sampling. Since appropriate sampling rates for unknown dynamical systems are not known a priori, this robustness makes GDP suitable for real applications in scientific machine learning. Reproducible code is available at https://github.com/DaDaCheng/GDP. \ No newline at end of file diff --git a/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction b/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction new file mode 100644 index 0000000000..86d3761680 --- /dev/null +++ b/data/2024/aaai/A Hierarchical Network for Multimodal Document-Level Relation Extraction @@ -0,0 +1 @@ +Document-level relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems. \ No newline at end of file diff --git a/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning b/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning new file mode 100644 index 0000000000..0d79405120 --- /dev/null +++ b/data/2024/aaai/A Huber Loss Minimization Approach to Byzantine Robust Federated Learning @@ -0,0 +1 @@ +Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on epsilon, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of epsilon. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions. \ No newline at end of file diff --git a/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health b/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health new file mode 100644 index 0000000000..94dcaf71ed --- /dev/null +++ b/data/2024/aaai/A Hybrid AI Framework for Sensor-Based Personal Health Monitoring towards Precision Health @@ -0,0 +1 @@ +Non-communicable diseases are on the rise globally, resulting in accelerated efforts to develop personal health monitoring systems for early detection, prediction, and prevention of diseases. This is part of the vision of precision health, an emerging paradigm that focuses on preventing disease before it strikes by encouraging people to actively monitor and work towards improving their health. A key facilitator of this is the use of wearable sensors that can collect and measure physiological data.Although many sensor-based health monitoring systems have been proposed, interoperability of health data and processes, prediction of future health states, and uncertainty management remain open challenges. This research aims to alleviate these challenges through the development of a reusable framework integrating both data-driven and knowledge-driven AI within a hybrid AI architecture. \ No newline at end of file diff --git a/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection b/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection new file mode 100644 index 0000000000..489b325f2d --- /dev/null +++ b/data/2024/aaai/A Hybrid Global-Local Perception Network for Lane Detection @@ -0,0 +1 @@ +Lane detection is a critical task in autonomous driving, which requires accurately predicting the complex topology of lanes in various scenarios. While previous methods of lane detection have shown success, challenges still exist, especially in scenarios where lane markings are absent. In this paper, we analyze the role of global and local features in accurately detecting lanes and propose a Hybrid Global-Local Perception Network (HGLNet) to leverage them. Global and local features play distinct roles in lane detection by respectively aiding in the detection of lane instances and the localization of corresponding lanes. HGLNet extracts global semantic context by utilizing a global extraction head that aggregates information about adaptive sampling points around lanes, achieving an optimal trade-off between performance and efficiency. Moreover, we introduce a Multi-hierarchy feature aggregator (MFA) to capture feature hierarchies in both regional and local ranges, elevating the representation of local features. The proposed Hybrid architecture can simultaneously focus on global and local features at different depth levels and efficiently integrate them to sense the global presence of lanes and accurately regress their locations. Experimental results demonstrate that our proposed method improves detection accuracy in various challenging scenarios, outperforming the state-of-the-art lane detection methods. \ No newline at end of file diff --git a/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction b/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction new file mode 100644 index 0000000000..80ff306fff --- /dev/null +++ b/data/2024/aaai/A Joint Framework with Heterogeneous-Relation-Aware Graph and Multi-Channel Label Enhancing Strategy for Event Causality Extraction @@ -0,0 +1 @@ +Event Causality Extraction (ECE) aims to extract the cause-effect event pairs with their structured event information from plain texts. As far as we know, the existing ECE methods mainly focus on the correlation between arguments, without explicitly modeling the causal relationship between events, and usually design two independent frameworks to extract cause events and effect events, respectively, which cannot effectively capture the dependency between the subtasks. Therefore, we propose a joint multi-label extraction framework for ECE to alleviate the above limitations. In particular, 1) we design a heterogeneous-relation-aware graph module to learn the potential relationships between events and arguments, in which we construct the heterogeneous graph by taking the predefined event types and all the words in the sentence as nodes, and modeling three relationships of "event-event", "event-argument" and "argument-argument" as edges. 2) We also design a multi-channel label enhancing module to better learn the distributed representation of each label in the multi-label extraction framework, and further enhance the interaction between the subtasks by considering the preliminary results of cause-effect type identification and event argument extraction. The experimental results on the benchmark dataset ECE-CCKS show that our approach outperforms previous state-of-the-art methods, and that our model also performs well on the complex samples with multiple cause-effect event pairs. \ No newline at end of file diff --git a/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification b/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification new file mode 100644 index 0000000000..45227d4b00 --- /dev/null +++ b/data/2024/aaai/A Label Disambiguation-Based Multimodal Massive Multiple Instance Learning Approach for Immune Repertoire Classification @@ -0,0 +1 @@ +One individual human’s immune repertoire consists of a huge set of adaptive immune receptors at a certain time point, representing the individual's adaptive immune state. Immune repertoire classification and associated receptor identification have the potential to make a transformative contribution to the development of novel vaccines and therapies. The vast number of instances and exceedingly low witness rate pose a great challenge to the immune repertoire classification, which can be formulated as a Massive Multiple Instance Learning (MMIL) problem. Traditional MIL methods, at both bag-level and instance-level, confront the issues of substantial computational burden or supervision ambiguity when handling massive instances. To address these issues, we propose a novel label disambiguation-based multimodal massive multiple instance learning approach (LaDM³IL) for immune repertoire classification. LaDM³IL adapts the instance-level MIL paradigm to deal with the issue of high computational cost and employs a specially-designed label disambiguation module for label correction, mitigating the impact of misleading supervision. To achieve a more comprehensive representation of each receptor, LaDM³IL leverages a multimodal fusion module with gating-based attention and tensor-fusion to integrate the information from gene segments and amino acid (AA) sequences of each immune receptor. Extensive experiments on the Cytomegalovirus (CMV) and Cancer datasets demonstrate the superior performance of the proposed LaDM³IL for both immune repertoire classification and associated receptor identification tasks. The code is publicly available at https://github.com/Josie-xufan/LaDM3IL. \ No newline at end of file diff --git a/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis b/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis new file mode 100644 index 0000000000..3ba90751cc --- /dev/null +++ b/data/2024/aaai/A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis @@ -0,0 +1 @@ +The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets. \ No newline at end of file diff --git a/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface b/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface new file mode 100644 index 0000000000..bb736c4dd3 --- /dev/null +++ b/data/2024/aaai/A Local-Ascending-Global Learning Strategy for Brain-Computer Interface @@ -0,0 +1 @@ +Neuroscience research indicates that the interaction among different functional regions of the brain plays a crucial role in driving various cognitive tasks. Existing studies have primarily focused on constructing either local or global functional connectivity maps within the brain, often lacking an adaptive approach to fuse functional brain regions and explore latent relationships between localization during different cognitive tasks. This paper introduces a novel approach called the Local-Ascending-Global Learning Strategy (LAG) to uncover higher-level latent topological patterns among functional brain regions. The strategy initiates from the local connectivity of individual brain functional regions and develops a K-Level Self-Adaptive Ascending Network (SALK) to dynamically capture strong connectivity patterns among brain regions during different cognitive tasks. Through the step-by-step fusion of brain regions, this approach captures higher-level latent patterns, shedding light on the progressively adaptive fusion of various brain functional regions under different cognitive tasks. Notably, this study represents the first exploration of higher-level latent patterns through progressively adaptive fusion of diverse brain functional regions under different cognitive tasks. The proposed LAG strategy is validated using datasets related to fatigue (SEED-VIG), emotion (SEED-IV), and motor imagery (BCI_C_IV_2a). The results demonstrate the generalizability of LAG, achieving satisfactory outcomes in independent-subject experiments across all three datasets. This suggests that LAG effectively characterizes higher-level latent patterns associated with different cognitive tasks, presenting a novel approach to understanding brain patterns in varying cognitive contexts. \ No newline at end of file diff --git a/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning b/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning new file mode 100644 index 0000000000..24c2f6fe01 --- /dev/null +++ b/data/2024/aaai/A Model for Estimating the Economic Costs of Computer Vision Systems That Use Deep Learning @@ -0,0 +1 @@ +Deep learning, the most important subfield of machine learning and artificial intelligence (AI) over the last decade, is considered one of the fundamental technologies underpinning the Fourth Industrial Revolution. But despite its record-breaking history, deep learning’s enormous appetite for compute and data means that sometimes it can be too costly to practically use. In this paper, we connect technical insights from deep learning scaling laws and transfer learning with the economics of IT to propose a framework for estimating the cost of deep learning computer vision systems to achieve a desired level of accuracy. Our tool can be of practical use to AI practitioners in industry or academia to guide investment decisions. \ No newline at end of file diff --git a/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection b/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection new file mode 100644 index 0000000000..a866560af5 --- /dev/null +++ b/data/2024/aaai/A New Benchmark and Model for Challenging Image Manipulation Detection @@ -0,0 +1 @@ +The ability to detect manipulation in multimedia data is vital in digital forensics. Existing Image Manipulation Detection (IMD) methods are mainly based on detecting anomalous features arisen from image editing or double compression artifacts. All existing IMD techniques encounter challenges when it comes to detecting small tampered regions from a large image. Moreover, compression-based IMD approaches face difficulties in cases of double compression of identical quality factors. To investigate the State-of-The-Art (SoTA) IMD methods in those challenging conditions, we introduce a new Challenging Image Manipulation Detection (CIMD) benchmark dataset, which consists of two subsets, for evaluating editing-based and compression-based IMD methods, respectively. The dataset images were manually taken and tampered with high-quality annotations. In addition, we propose a new two-branch network model based on HRNet that can better detect both the image-editing and compression artifacts in those challenging conditions. Extensive experiments on the CIMD benchmark show that our model significantly outperforms SoTA IMD methods on CIMD. The dataset is available at: https://github.com/ZhenfeiZ/CIMD. \ No newline at end of file diff --git a/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning b/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning new file mode 100644 index 0000000000..f15326b134 --- /dev/null +++ b/data/2024/aaai/A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning @@ -0,0 +1 @@ +Graph contrastive learning (GCL) has attracted considerable attention because it can self-supervisedly extract low-dimensional representation of graph data. InfoNCE-based loss function is widely used in graph contrastive learning, which pulls the representations of positive pairs close to each other and pulls the representations of negative pairs away from each other. Recent works mainly focus on designing new augmentation methods or sampling strategies. However, we argue that the widely used InfoNCE-based methods may contain an implicit conflict which seriously confuses models when learning from negative pairs. This conflict is engendered by the encoder's message-passing mechanism and the InfoNCE loss function. As a result, the learned representations between negative samples cannot be far away from each other, compromising the model performance. To our best knowledge, this is the first time to report and analysis this conflict of GCL. To address this problem, we propose a simple but effective method called Partial ignored Graph Contrastive Learning (PiGCL). Specifically, PiGCL first dynamically captures the conflicts during training by detecting the gradient of representation similarities. It then enables the loss function to ignore the conflict, allowing the encoder to adaptively learn the ignored information without self-supervised samples. Extensive experiments demonstrate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data b/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data new file mode 100644 index 0000000000..e6ca2723f9 --- /dev/null +++ b/data/2024/aaai/A Non-parametric Graph Clustering Framework for Multi-View Data @@ -0,0 +1,3 @@ +Multi-view graph clustering (MVGC) derives encouraging grouping results by seamlessly integrating abundant information inside heterogeneous data, and has captured surging focus recently. +Nevertheless, the majority of current MVGC works involve at least one hyper-parameter, which not only requires additional efforts for tuning, but also leads to a complicated solving procedure, +largely harming the flexibility and scalability of corresponding algorithms. To this end, in the article we are devoted to getting rid of hyper-parameters, and devise a non-parametric graph clustering (NpGC) framework to more practically partition multi-view data. To be specific, we hold that hyper-parameters play a role in balancing error item and regularization item so as to form high-quality clustering representations. Therefore, under without the assistance of hyper-parameters, how to acquire high-quality representations becomes the key. Inspired by this, we adopt two types of anchors, view-related and view-unrelated, to concurrently mine exclusive characteristics and common characteristics among views. Then, all anchors' information is gathered together via a consensus bipartite graph. By such ways, NpGC extracts both complementary and consistent multi-view features, thereby obtaining superior clustering results. Also, linear complexities enable it to handle datasets with over 120000 samples. Numerous experiments reveal NpGC's strong points compared to lots of classical approaches. \ No newline at end of file diff --git a/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates b/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates new file mode 100644 index 0000000000..6c8bdfb4be --- /dev/null +++ b/data/2024/aaai/A Novel Approach for Longitudinal Modeling of Aging Health and Predicting Mortality Rates @@ -0,0 +1 @@ +Aging is a complex stochastic process that affects healthy functioning through various pathways. In contrast to the more commonly used cross-sectional methods, our research focuses on longitudinal modeling of aging, a less explored but crucial area. We have developed a Stochastic Differential Equation (SDE) model, at the forefront of aging research, designed to accurately forecast the health trajectories and survival rates of individuals. This model adeptly delineates the connections between different health indicators and provides clear, interpretable results. Our approach utilizes the SDE framework to encapsulate the inherent uncertainty in the aging process. Moreover, it incorporates a Recurrent Neural Network (RNN) to integrate past health data into future health projections. We plan to train and test our model using a comprehensive dataset tailored for aging studies. This model is not only computationally cost-effective but also highly relevant in assessing health risks in older populations, particularly for those at high risk. It can serve as an essential tool in anticipating and preparing for challenges like infectious disease outbreaks. Overall, our research aims to improve health equity and global health security significantly, offering substantial benefits to public health and deepening our understanding of the aging process. \ No newline at end of file diff --git a/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis b/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis new file mode 100644 index 0000000000..e54a23068a --- /dev/null +++ b/data/2024/aaai/A Novel Energy Based Model Mechanism for Multi-Modal Aspect-Based Sentiment Analysis @@ -0,0 +1 @@ +Multi-modal aspect-based sentiment analysis (MABSA) has recently attracted increasing attention. The span-based extraction methods, such as FSUIE, demonstrate strong performance in sentiment analysis due to their joint modeling of input sequences and target labels. However, previous methods still have certain limitations: (i) They ignore the difference in the focus of visual information between different analysis targets (aspect or sentiment). (ii) Combining features from uni-modal encoders directly may not be sufficient to eliminate the modal gap and can cause difficulties in capturing the image-text pairwise relevance. (iii) Existing span-based methods for MABSA ignore the pairwise relevance of target span boundaries. To tackle these limitations, we propose a novel framework called DQPSA. Specifically, our model contains a Prompt as Dual Query (PDQ) module that uses the prompt as both a visual query and a language query to extract prompt-aware visual information and strengthen the pairwise relevance between visual information and the analysis target. Additionally, we introduce an Energy-based Pairwise Expert (EPE) module that models the boundaries pairing of the analysis target from the perspective of an Energy-based Model. This expert predicts aspect or sentiment span based on pairwise stability. Experiments on three widely used benchmarks demonstrate that DQPSA outperforms previous approaches and achieves a new state-of-the-art performance. The code will be released at https://github.com/pengts/DQPSA. \ No newline at end of file diff --git a/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem b/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem new file mode 100644 index 0000000000..61e1167e62 --- /dev/null +++ b/data/2024/aaai/A Novel Skip Orthogonal List for Dynamic Optimal Transport Problem @@ -0,0 +1 @@ +Optimal transport is a fundamental topic that has attracted a great amount of attention from the optimization community in the past decades. In this paper, we consider an interesting discrete dynamic optimal transport problem: can we efficiently update the optimal transport plan when the weights or the locations of the data points change? This problem is naturally motivated by several applications in machine learning. For example, we often need to compute the optimal transport cost between two different data sets; if some changes happen to a few data points, should we re-compute the high complexity cost function or update the cost by some efficient dynamic data structure? We are aware that several dynamic maximum flow algorithms have been proposed before, however, the research on dynamic minimum cost flow problem is still quite limited, to the best of our knowledge. We propose a novel 2D Skip Orthogonal List together with some dynamic tree techniques. Although our algorithm is based on the conventional simplex method, it can efficiently find the variable to pivot within expected O(1) time, and complete each pivoting operation within expected O(|V|) time where V is the set of all supply and demand nodes. Since dynamic modifications typically do not introduce significant changes, our algorithm requires only a few simplex iterations in practice. So our algorithm is more efficient than re-computing the optimal transport cost that needs at least one traversal over all |E|=O(|V|^2) variables, where |E| denotes the number of edges in the network. Our experiments demonstrate that our algorithm significantly outperforms existing algorithms in the dynamic scenarios. \ No newline at end of file diff --git a/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs b/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs new file mode 100644 index 0000000000..e2205166e5 --- /dev/null +++ b/data/2024/aaai/A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs @@ -0,0 +1 @@ +Linear temporal logic (LTL) and omega-regular objectives---a superset of LTL---have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory. \ No newline at end of file diff --git a/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning b/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning new file mode 100644 index 0000000000..56d9e2061a --- /dev/null +++ b/data/2024/aaai/A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning @@ -0,0 +1 @@ +Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%. \ No newline at end of file diff --git a/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators b/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators new file mode 100644 index 0000000000..13e2659dda --- /dev/null +++ b/data/2024/aaai/A Picture Is Worth a Thousand Words: Co-designing Text-to-Image Generation Learning Materials for K-12 with Educators @@ -0,0 +1 @@ +Text-to-image generation (TTIG) technologies are Artificial Intelligence (AI) algorithms that use natural language algorithms in combination with visual generative algorithms. TTIG tools have gained popularity in recent months, garnering interest from non-AI experts, including educators and K-12 students. While they have exciting creative potential when used by K-12 learners and educators for creative learning, they are also accompanied by serious ethical implications, such as data privacy, spreading misinformation, and algorithmic bias. Given the potential learning applications, social implications, and ethical concerns, we designed 6-hour learning materials to teach K-12 teachers from diverse subject expertise about the technical implementation, classroom applications, and ethical implications of TTIG algorithms. We piloted the learning materials titled “Demystify text-to-image generative tools for K-12 educators" with 30 teachers across two workshops with the goal of preparing them to teach about and use TTIG tools in their classrooms. We found that teachers demonstrated a technical, applied and ethical understanding of TTIG algorithms and successfully designed prototypes of teaching materials for their classrooms. \ No newline at end of file diff --git a/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation b/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation new file mode 100644 index 0000000000..eccd22bf2f --- /dev/null +++ b/data/2024/aaai/A Plug-and-Play Quaternion Message-Passing Module for Molecular Conformation Representation @@ -0,0 +1,8 @@ +Graph neural networks have been widely used to represent 3D molecules, which capture molecular attributes and geometric information through various message-passing mechanisms. +This study proposes a novel quaternion message-passing (QMP) module that can be plugged into many existing 3D molecular representation models and enhance their power for distinguishing molecular conformations. +In particular, our QMP module represents the 3D rotations between one chemical bond and its neighbor bonds as a quaternion sequence. +Then, it aggregates the rotations by the chained Hamilton product of the quaternions. +The real part of the output quaternion is invariant to the global 3D rotations of molecules but sensitive to the local torsions caused by twisting bonds, providing discriminative information for training molecular conformation representation models. +In theory, we prove that considering these features enables invariant GNNs to distinguish the conformations caused by bond torsions. +We encapsulate the QMP module with acceleration, so combining existing models with the QMP requires merely one-line code and little computational cost. +Experiments on various molecular datasets show that plugging our QMP module into existing invariant GNNs leads to consistent and significant improvements in molecular conformation representation and downstream tasks. \ No newline at end of file diff --git a/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling b/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling new file mode 100644 index 0000000000..0c64f565be --- /dev/null +++ b/data/2024/aaai/A Positive-Unlabeled Metric Learning Framework for Document-Level Relation Extraction with Incomplete Labeling @@ -0,0 +1 @@ +The goal of document-level relation extraction (RE) is to identify relations between entities that span multiple sentences. Recently, incomplete labeling in document-level RE has received increasing attention, and some studies have used methods such as positive-unlabeled learning to tackle this issue, but there is still a lot of room for improvement. Motivated by this, we propose a positive-augmentation and positive-mixup positive-unlabeled metric learning framework (P3M). Specifically, we formulate document-level RE as a metric learning problem. We aim to pull the distance closer between entity pair embedding and their corresponding relation embedding, while pushing it farther away from the none-class relation embedding. Additionally, we adapt the positive-unlabeled learning to this loss objective. In order to improve the generalizability of the model, we use dropout to augment positive samples and propose a positive-none-class mixup method. Extensive experiments show that P3M improves the F1 score by approximately 4-10 points in document-level RE with incomplete labeling, and achieves state-of-the-art results in fully labeled scenarios. Furthermore, P3M has also demonstrated robustness to prior estimation bias in incomplete labeled scenarios. \ No newline at end of file diff --git a/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields b/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields new file mode 100644 index 0000000000..b64c701edd --- /dev/null +++ b/data/2024/aaai/A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields @@ -0,0 +1 @@ +Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images. However, the use of simplified lighting models such as environment maps to represent non-distant illumination, or using a network to fit indirect light modeling without a solid basis, can lead to an undesirable decomposition between lighting and material. To address this, we propose a fully differentiable framework named Neural Illumination Fields (NeIF) that uses radiance fields as a lighting model to handle complex lighting in a physically based way. Together with integral lobe encoding for roughness-adaptive specular lobe and leveraging the pre-convolved background for accurate decomposition, the proposed method represents a significant step towards integrating physically based rendering into the NeRF representation. The experiments demonstrate the superior performance of novel-view rendering compared to previous works, and the capability to re-render objects under arbitrary NeRF-style environments opens up exciting possibilities for bridging the gap between virtual and real-world scenes. \ No newline at end of file diff --git a/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning b/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning new file mode 100644 index 0000000000..fea9c7cc1f --- /dev/null +++ b/data/2024/aaai/A Primal-Dual Algorithm for Hybrid Federated Learning @@ -0,0 +1 @@ +Very few methods for hybrid federated learning, where clients only hold subsets of both features and samples, exist. Yet, this scenario is very important in practical settings. We provide a fast, robust algorithm for hybrid federated learning that hinges on Fenchel Duality. We prove the convergence of the algorithm to the same solution as if the model was trained centrally in a variety of practical regimes. Furthermore, we provide experimental results that demonstrate the performance improvements of the algorithm over a commonly used method in federated learning, FedAvg, and an existing hybrid FL algorithm, HyFEM. We also provide privacy considerations and necessary steps to protect client data. \ No newline at end of file diff --git a/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities b/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities new file mode 100644 index 0000000000..7227747514 --- /dev/null +++ b/data/2024/aaai/A Privacy Preserving Federated Learning (PPFL) Based Cognitive Digital Twin (CDT) Framework for Smart Cities @@ -0,0 +1 @@ +A Smart City is one that makes better use of city data to make our communities better places to live. Typically, this has 3 components: sensing (data collection), analysis and actuation. Privacy, particularly as it relates to citizen's data, is a cross-cutting theme. A Digital Twin (DT) is a virtual replica of a real-world physical entity. Cognitive Digital Twins (CDT) are DTs enhanced with cognitive AI capabilities. Both DTs and CDTs have seen adoption in the manufacturing and industrial sectors however cities are slow to adopt these because of privacy concerns. This work attempts to address these concerns by proposing a Privacy Preserving Federated Learning (PPFL) based Cognitive Digital Twin framework for Smart Cities. \ No newline at end of file diff --git a/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression b/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression new file mode 100644 index 0000000000..f463041ef6 --- /dev/null +++ b/data/2024/aaai/A Provably Accurate Randomized Sampling Algorithm for Logistic Regression @@ -0,0 +1 @@ +In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets. \ No newline at end of file diff --git a/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation b/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation new file mode 100644 index 0000000000..81dd6e64b4 --- /dev/null +++ b/data/2024/aaai/A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation @@ -0,0 +1 @@ +Column generation (CG) is one of the most successful approaches for solving large-scale linear programming (LP) problems. Given an LP with a prohibitively large number of variables (i.e., columns), the idea of CG is to explicitly consider only a subset of columns and iteratively add potential columns to improve the objective value. While adding the column with the most negative reduced cost can guarantee the convergence of CG, it has been shown that adding multiple columns per iteration rather than a single column can lead to faster convergence. However, it remains a challenge to design a multiple-column selection strategy to select the most promising columns from a large number of candidate columns. In this paper, we propose a novel reinforcement-learning-based (RL) multiple-column selection strategy. To the best of our knowledge, it is the first RL-based multiple-column selection strategy for CG. The effectiveness of our approach is evaluated on two sets of problems: the cutting stock problem and the graph coloring problem. Compared to several widely used single-column and multiple-column selection strategies, our RL-based multiple-column selection strategy leads to faster convergence and achieves remarkable reductions in the number of CG iterations and runtime. \ No newline at end of file diff --git a/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency b/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency new file mode 100644 index 0000000000..8b1cee3816 --- /dev/null +++ b/data/2024/aaai/A Robust Mutual-Reinforcing Framework for 3D Multi-Modal Medical Image Fusion Based on Visual-Semantic Consistency @@ -0,0 +1 @@ +This work proposes a robust 3D medical image fusion framework to establish a mutual-reinforcing mechanism between visual fusion and lesion segmentation, achieving their double improvement. Specifically, we explore the consistency between vision and semantics by sharing feature fusion modules. Through the coupled optimization of the visual fusion loss and the lesion segmentation loss, visual-related and semantic-related features will be pulled into the same domain, effectively promoting accuracy improvement in a mutual-reinforcing manner. Further, we establish the robustness guarantees by constructing a two-level refinement constraint in the process of feature extraction and reconstruction. Benefiting from full consideration for common degradations in medical images, our framework can not only provide clear visual fusion results for doctor's observation, but also enhance the defense ability of lesion segmentation against these negatives. Extensive evaluations of visual fusion and lesion segmentation scenarios demonstrate the advantages of our method in terms of accuracy and robustness. Moreover, our proposed framework is generic, which can be well-compatible with existing lesion segmentation algorithms and improve their performance. The code is publicly available at https://github.com/HaoZhang1018/RMR-Fusion. \ No newline at end of file diff --git a/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract) b/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract) new file mode 100644 index 0000000000..d27590f476 --- /dev/null +++ b/data/2024/aaai/A SAT + Computer Algebra System Verification of the Ramsey Problem R(3, 8) (Student Abstract) @@ -0,0 +1 @@ +The Ramsey problem R(3,8) asks for the smallest n such that every red/blue coloring of the complete graph on n vertices must contain either a blue triangle or a red 8-clique. We provide the first certifiable proof that R(3,8) = 28, automatically generated by a combination of Boolean satisfiability (SAT) solver and a computer algebra system (CAS). This SAT+CAS combination is significantly faster than a SAT-only approach. While the R(3,8) problem was first computationally solved by McKay and Min in 1992, it was not a verifiable proof. The SAT+CAS method that we use for our proof is very general and can be applied to a wide variety of combinatorial problems. \ No newline at end of file diff --git a/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract) b/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract) new file mode 100644 index 0000000000..f5bd9bc099 --- /dev/null +++ b/data/2024/aaai/A SAT Solver and Computer Algebra Attack on the Minimum Kochen-Specker Problem (Student Abstract) @@ -0,0 +1 @@ +The problem of finding the minimum three-dimensional Kochen–Specker (KS) vector system, an important problem in quantum foundations, has remained open for over 55 years. We present a new method to address this problem based on a combination of a Boolean satisfiability (SAT) solver and a computer algebra system (CAS). Our approach improved the lower bound on the size of a KS system from 22 to 24. More importantly, we provide the first computer-verifiable proof certificate of a lower bound to the KS problem with a proof size of 41.6 TiB for order 23. The efficiency is due to the powerful combination of SAT solvers and CAS-based orderly generation. \ No newline at end of file diff --git a/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions b/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions new file mode 100644 index 0000000000..69b8abfbf5 --- /dev/null +++ b/data/2024/aaai/A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions @@ -0,0 +1 @@ +Score matching based diffusion has shown to achieve the state of art results in generation modeling. In the original score matching based diffusion algorithm, the forward equation is a differential equation for which the probability density equation evolves according to a linear partial differential equation, the Fokker-Planck equation. A drawback of this approach is that one needs the data distribution to have a Lipschitz logarithmic gradient. This excludes a large class of data distributions that have a compact support. We present a deterministic diffusion process for which the vector fields are always Lipschitz and hence the score does not explode for probability measures with compact support. This deterministic diffusion process can be seen as a regularization of the porous media equation equation, which enables one to guarantee long term convergence of the forward process to the noise distribution. Though the porous media equation is itself not always guaranteed to have a Lipschitz vector field, it can be used to understand the closeness of the output of the algorithm to the data distribution as a function of the the time horizon and score matching error. This analysis enables us to show that the algorithm has better dependence on the score matching error than approaches based on stochastic diffusions. Using numerical experiments we verify our theoretical results on example one and two dimensional data distributions which are compactly supported. Additionally, we validate the approach on a modified MNIST data set for which the distribution is concentrated on a compact set. In each of the experiments, the approach using deterministic diffusion performs better that the diffusion algorithm with stochastic forward process, when considering the FID scores of the generated samples. \ No newline at end of file diff --git a/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation b/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation new file mode 100644 index 0000000000..d2e516dbf9 --- /dev/null +++ b/data/2024/aaai/A Separation and Alignment Framework for Black-Box Domain Adaptation @@ -0,0 +1 @@ +Black-box domain adaptation (BDA) targets to learn a classifier on an unsupervised target domain while assuming only access to black-box predictors trained from unseen source data. Although a few BDA approaches have demonstrated promise by manipulating the transferred labels, they largely overlook the rich underlying structure in the target domain. To address this problem, we introduce a novel separation and alignment framework for BDA. Firstly, we locate those well-adapted samples via loss ranking and a flexible confidence-thresholding procedure. Then, we introduce a novel graph contrastive learning objective that aligns under-adapted samples to their local neighbors and well-adapted samples. Lastly, the adaptation is finally achieved by a nearest-centroid-augmented objective that exploits the clustering effect in the feature space. Extensive experiments demonstrate that our proposed method outperforms best baselines on benchmark datasets, e.g. improving the averaged per-class accuracy by 4.1% on the VisDA dataset. The source code is available at: https://github.com/MingxuanXia/SEAL. \ No newline at end of file diff --git a/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes b/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes new file mode 100644 index 0000000000..3a7824663f --- /dev/null +++ b/data/2024/aaai/A Sequentially Fair Mechanism for Multiple Sensitive Attributes @@ -0,0 +1 @@ +In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less straightfoward in the case of multiple sensitive attributes. To tackle this issue, we propose a sequential framework, which allows to progressively achieve fairness across a set of sensitive features. We accomplish this by leveraging multi-marginal Wasserstein barycenters, which extends the standard notion of Strong Demographic Parity to the case with multiple sensitive characteristics. This method also provides a closed-form solution for the optimal, sequentially fair predictor, permitting a clear interpretation of inter-sensitive feature correlations. Our approach seamlessly extends to approximate fairness, enveloping a framework accommodating the trade-off between risk and unfairness. This extension permits a targeted prioritization of fairness improvements for a specific attribute within a set of sensitive attributes, allowing for a case specific adaptation. A data-driven estimation procedure for the derived solution is developed, and comprehensive numerical experiments are conducted on both synthetic and real datasets. Our empirical findings decisively underscore the practical efficacy of our post-processing approach in fostering fair decision-making. \ No newline at end of file diff --git a/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks b/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks new file mode 100644 index 0000000000..fff13b35f1 --- /dev/null +++ b/data/2024/aaai/A Simple and Yet Fairly Effective Defense for Graph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have emerged as the dominant approach for machine learning on graph-structured data. However, concerns have arisen regarding the vulnerability of GNNs to small adversarial perturbations. Existing defense methods against such perturbations suffer from high time complexity and can negatively impact the model's performance on clean graphs. To address these challenges, this paper introduces NoisyGNNs, a novel defense method that incorporates noise into the underlying model's architecture. We establish a theoretical connection between noise injection and the enhancement of GNN robustness, highlighting the effectiveness of our approach. We further conduct extensive empirical evaluations on the node classification task to validate our theoretical findings, focusing on two popular GNNs: the GCN and GIN. The results demonstrate that NoisyGNN achieves superior or comparable defense performance to existing methods while minimizing added time complexity. The NoisyGNN approach is model-agnostic, allowing it to be integrated with different GNN architectures. Successful combinations of our NoisyGNN approach with existing defense techniques demonstrate even further improved adversarial defense results. Our code is publicly available at: https://github.com/Sennadir/NoisyGNN. \ No newline at end of file diff --git a/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval b/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval new file mode 100644 index 0000000000..2a09c55d0a --- /dev/null +++ b/data/2024/aaai/A Submodular Optimization Approach to Accountable Loan Approval @@ -0,0 +1,3 @@ +In the field of finance, the underwriting process is an essential step in evaluating every loan application. During this stage, the borrowers' creditworthiness and ability to repay the loan are assessed to ultimately decide whether to approve the loan application. One of the core components of underwriting is credit scoring, in which the probability of default is estimated. +As such, there has been significant progress in enhancing the predictive accuracy of credit scoring models through the use of machine learning, but there still exists a need to ultimately construct an approval rule that takes into consideration additional criteria beyond the score itself. This construction process is traditionally done manually to ensure that the approval rule remains interpretable to humans. +In this paper, we outline an automated system for optimizing a rule-based system for approving loan applications, which has been deployed at Hyundai Capital Services (HCS). The main challenge lay in creating a high-quality rule base that is simultaneously simple enough to be interpretable by risk analysts as well as customers, since the approval decision should be accountable. We addressed this challenge through principled submodular optimization. The deployment of our system has led to a 14% annual growth in the volume of loan services at HCS, while maintaining the target bad rate, and has resulted in the approval of customers who might have otherwise been rejected. \ No newline at end of file diff --git a/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees b/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees new file mode 100644 index 0000000000..9c80f98ad6 --- /dev/null +++ b/data/2024/aaai/A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees @@ -0,0 +1 @@ +The Partially Observable Markov Decision Process (POMDP) provides a principled framework for decision making in stochastic partially observable environments. However, computing good solutions for problems with continuous action spaces remains challenging. To ease this challenge, we propose a simple online POMDP solver, called Lazy Cross-Entropy Search Over Policy Trees (LCEOPT). At each planning step, our method uses a novel lazy Cross-Entropy method to search the space of policy trees, which provide a simple policy representation. Specifically, we maintain a distribution on promising finite-horizon policy trees. The distribution is iteratively updated by sampling policies, evaluating them via Monte Carlo simulation, and refitting them to the top-performing ones. Our method is lazy in the sense that it exploits the policy tree representation to avoid redundant computations in policy sampling, evaluation, and distribution update. This leads to computational savings of up to two orders of magnitude. Our LCEOPT is surprisingly simple as compared to existing state-of-the-art methods, yet empirically outperforms them on several continuous-action POMDP problems, particularly for problems with higher-dimensional action spaces. \ No newline at end of file diff --git a/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint) b/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint) new file mode 100644 index 0000000000..ec86761448 --- /dev/null +++ b/data/2024/aaai/A Survey of Learning Criteria Going beyond the Usual Risk (Abstract Reprint) @@ -0,0 +1 @@ +Virtually all machine learning tasks are characterized using some form of loss function, and “good performance” is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of “what makes for a desirable loss distribution?” in place of tacit use of the expected loss. \ No newline at end of file diff --git a/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks b/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks new file mode 100644 index 0000000000..ae5422ab2f --- /dev/null +++ b/data/2024/aaai/A Theory of Non-acyclic Generative Flow Networks @@ -0,0 +1 @@ +GFlowNets is a novel flow-based method for learning a stochastic policy to generate objects via a sequence of actions and with probability proportional to a given positive reward. We contribute to relaxing hypotheses limiting the application range of GFlowNets, in particular: acyclicity (or lack thereof). To this end, we extend the theory of GFlowNets on measurable spaces which includes continuous state spaces without cycle restrictions, and provide a generalization of cycles in this generalized context. We show that losses used so far push flows to get stuck into cycles and we define a family of losses solving this issue. Experiments on graphs and continuous tasks validate those principles. \ No newline at end of file diff --git a/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos b/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos new file mode 100644 index 0000000000..35210ed8f5 --- /dev/null +++ b/data/2024/aaai/A Toolbox for Modelling Engagement with Educational Videos @@ -0,0 +1 @@ +With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling. TrueLearn family of models was designed following the "open learner" concept, using humanly-intuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders. \ No newline at end of file diff --git a/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning b/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning new file mode 100644 index 0000000000..bdf57df275 --- /dev/null +++ b/data/2024/aaai/A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning @@ -0,0 +1 @@ +Transfer learning (TL) has shown great potential to improve Reinforcement Learning (RL) efficiency by leveraging prior knowledge in new tasks. However, much of the existing TL research focuses on transferring knowledge between tasks that share the same state-action spaces. Further, transfer from multiple source tasks that have different state-action spaces is more challenging and needs to be solved urgently to improve the generalization and practicality of the method in real-world scenarios. This paper proposes TURRET (Transfer Using gRaph neuRal nETworks), to utilize the generalization capabilities of Graph Neural Networks (GNNs) to facilitate efficient and effective multi-source policy transfer learning in the state-action mismatch setting. TURRET learns a semantic representation by accounting for the intrinsic property of the agent through GNNs, which leads to a unified state embedding space for all tasks. As a result, TURRET achieves more efficient transfer with strong generalization ability between different tasks and can be easily combined with existing Deep RL algorithms. Experimental results show that TURRET significantly outperforms other TL methods on multiple continuous action control tasks, successfully transferring across robots with different state-action spaces. \ No newline at end of file diff --git a/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks b/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks new file mode 100644 index 0000000000..58c23bf57f --- /dev/null +++ b/data/2024/aaai/A Twist for Graph Classification: Optimizing Causal Information Flow in Graph Neural Networks @@ -0,0 +1 @@ +Graph neural networks (GNNs) have achieved state-of-the-art results on many graph representation learning tasks by exploiting statistical correlations. However, numerous observations have shown that such correlations may not reflect the true causal mechanisms underlying the data and thus may hamper the ability of the model to generalize beyond the observed distribution. To address this problem, we propose an Information-based Causal Learning (ICL) framework that combines information theory and causality to analyze and improve graph representation learning to transform information relevance to causal dependence. Specifically, we first introduce a multi-objective mutual information optimization objective derived from information-theoretic analysis and causal learning principles to simultaneously extract invariant and interpretable causal information and reduce reliance on non-causal information in correlations. To optimize this multi-objective objective, we enable a causal disentanglement layer that effectively decouples the causal and non-causal information in the graph representations. Moreover, due to the intractability of mutual information estimation, we derive variational bounds that enable us to transform the above objective into a tractable loss function. To balance the multiple information objectives and avoid optimization conflicts, we leverage multi-objective gradient descent to achieve a stable and efficient transformation from informational correlation to causal dependency. Our approach provides important insights into modulating the information flow in GNNs to enhance their reliability and generalization. Extensive experiments demonstrate that our approach significantly improves the robustness and interpretability of GNNs across different distribution shifts. Visual analysis demonstrates how our method converts informative dependencies in representations into causal dependencies. \ No newline at end of file diff --git a/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification b/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification new file mode 100644 index 0000000000..9d29f121ba --- /dev/null +++ b/data/2024/aaai/A Two-Stage Information Extraction Network for Incomplete Multi-View Multi-Label Classification @@ -0,0 +1 @@ +Recently, multi-view multi-label classification (MvMLC) has received a significant amount of research interest and many methods have been proposed based on the assumptions of view completion and label completion. However, in real-world scenarios, multi-view multi-label data tends to be incomplete due to various uncertainties involved in data collection and manual annotation. As a result, the conventional MvMLC methods fail. In this paper, we propose a new two-stage MvMLC network to solve this incomplete MvMLC issue with partial missing views and missing labels. Different from the existing works, our method attempts to leverage the diverse information from the partially missing data based on the information theory. Specifically, our method aims to minimize task-irrelevant information while maximizing task-relevant information through the principles of information bottleneck theory and mutual information extraction. The first stage of our network involves training view-specific classifiers to concentrate the task-relevant information. Subsequently, in the second stage, the hidden states of these classifiers serve as input for an alignment model, an autoencoder-based mutual information extraction framework, and a weighted fusion classifier to make the final prediction. Extensive experiments performed on five datasets validate that our method outperforms other state-of-the-art methods. Code is available at https://github.com/KevinTan10/TSIEN. \ No newline at end of file diff --git a/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction b/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction new file mode 100644 index 0000000000..6b35e34ab0 --- /dev/null +++ b/data/2024/aaai/A Unified Environmental Network for Pedestrian Trajectory Prediction @@ -0,0 +1 @@ +Accurately predicting pedestrian movements in complex environments is challenging due to social interactions, scene constraints, and pedestrians' multimodal behaviors. Sequential models like long short-term memory fail to effectively integrate scene features to make predicted trajectories comply with scene constraints due to disparate feature modalities of scene and trajectory. Though existing convolution neural network (CNN) models can extract scene features, they are ineffective in mapping these features into scene constraints for pedestrians and struggle to model pedestrian interactions due to the loss of target pedestrian information. To address these issues, we propose a unified environmental network based on CNN for pedestrian trajectory prediction. We introduce a polar-based method to reflect the distance and direction relationship between any position in the environment and the target pedestrian. This enables us to simultaneously model scene constraints and pedestrian social interactions in the form of feature maps. Additionally, we capture essential local features in the feature map, characterizing potential multimodal movements of pedestrians at each time step to prevent redundant predicted trajectories. We verify the performance of our proposed model on four trajectory prediction datasets, encompassing both short-term and long-term predictions. The experimental results demonstrate the superiority of our approach over existing methods. \ No newline at end of file diff --git a/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery b/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery new file mode 100644 index 0000000000..f6092b9827 --- /dev/null +++ b/data/2024/aaai/A Unified Knowledge Transfer Network for Generalized Category Discovery @@ -0,0 +1 @@ +Generalized Category Discovery (GCD) aims to recognize both known and novel categories in an unlabeled dataset by leveraging another labeled dataset with only known categories. Without considering knowledge transfer from known to novel categories, current methods usually perform poorly on novel categories due to the lack of corresponding supervision. To mitigate this issue, we propose a unified Knowledge Transfer Network (KTN), which solves two obstacles to knowledge transfer in GCD. First, the mixture of known and novel categories in unlabeled data makes it difficult to identify transfer candidates (i.e., samples with novel categories). For this, we propose an entropy-based method that leverages knowledge in the pre-trained classifier to differentiate known and novel categories without requiring extra data or parameters. Second, the lack of prior knowledge of novel categories presents challenges in quantifying semantic relationships between categories to decide the transfer weights. For this, we model different categories with prototypes and treat their similarities as transfer weights to measure the semantic similarities between categories. On the basis of two treatments, we transfer knowledge from known to novel categories by conducting pre-adjustment of logits and post-adjustment of labels for transfer candidates based on the transfer weights between different categories. With the weighted adjustment, KTN can generate more accurate pseudo-labels for unlabeled data, which helps to learn more discriminative features and boost model performance on novel categories. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Furthermore, different from previous clustering-based methods that can only work offline with abundant data, KTN can be deployed online conveniently with faster inference speed. Code and data are available at https://github.com/yibai-shi/KTN. \ No newline at end of file diff --git a/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis b/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis new file mode 100644 index 0000000000..c549667a03 --- /dev/null +++ b/data/2024/aaai/A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis @@ -0,0 +1 @@ +The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset while achieving state-of-the-art results in motion inbetweening on the LaFAN1 dataset for long transition periods. \ No newline at end of file diff --git a/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities b/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities new file mode 100644 index 0000000000..4f3b0ff84b --- /dev/null +++ b/data/2024/aaai/A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities @@ -0,0 +1 @@ +Multimodal Sentiment Analysis (MSA) has attracted widespread research attention recently. Most MSA studies are based on the assumption of modality completeness. However, many inevitable factors in real-world scenarios lead to uncertain missing modalities, which invalidate the fixed multimodal fusion approaches. To this end, we propose a Unified multimodal Missing modality self-Distillation Framework (UMDF) to handle the problem of uncertain missing modalities in MSA. Specifically, a unified self-distillation mechanism in UMDF drives a single network to automatically learn robust inherent representations from the consistent distribution of multimodal data. Moreover, we present a multi-grained crossmodal interaction module to deeply mine the complementary semantics among modalities through coarse- and fine-grained crossmodal attention. Eventually, a dynamic feature integration module is introduced to enhance the beneficial semantics in incomplete modalities while filtering the redundant information therein to obtain a refined and robust multimodal representation. Comprehensive experiments on three datasets demonstrate that our framework significantly improves MSA performance under both uncertain missing-modality and complete-modality testing conditions. \ No newline at end of file diff --git a/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis b/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis new file mode 100644 index 0000000000..47d083d84e --- /dev/null +++ b/data/2024/aaai/A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis @@ -0,0 +1 @@ +Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG. \ No newline at end of file diff --git a/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs b/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs new file mode 100644 index 0000000000..2adfacedb7 --- /dev/null +++ b/data/2024/aaai/A Variational Autoencoder for Neural Temporal Point Processes with Dynamic Latent Graphs @@ -0,0 +1 @@ +Continuously observed event occurrences, often exhibit self and mutually exciting effects, which can be well modeled using temporal point processes. Beyond that, these event dynamics may also change over time, with certain periodic trends. We propose a novel variational autoencoder to capture such a mixture of temporal dynamics. More specifically, the whole time interval of the input sequence is partitioned into a set of sub intervals. The event dynamics are assumed to be stationary within each subinterval, but could be changing across those subintervals. In particular, we use a sequential latent variable model to learn a dependency graph between the observed dimensions, for each subinterval. The model predicts the future event times, by using the learned dependency graph to remove the non contributing influences of past events. By doing so, the proposed model demonstrates its higher accuracy in predicting inter event times and event types for several real world event sequences, compared with existing state of the art neural point processes. \ No newline at end of file diff --git a/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level b/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level new file mode 100644 index 0000000000..8a58eea496 --- /dev/null +++ b/data/2024/aaai/A Virtual Driving Instructor That Generates Personalized Driving Lessons Based on Student Skill Level @@ -0,0 +1 @@ +Currently, students acquire driving skills by practicing in actual traffic conditions and through direct interactions with an instructor. While one-on-one interactions could be tailored to a student’s learning style and skill level, making them effective for learning, one-on-one interactions are also inefficient, potentially costly, and not standardized with limitations on which traffic situation can be safely taught. For these exact reasons Way AS has developed and commercially deployed a virtual driving instructor that educates students in high-fidelity simulators. In this paper, we present a module, the Lesson generator, that extends the virtual driving instructor to generate personalized lessons for individual students with the goal to practice in a focused and deliberately fashion the skills that need practice for the students to become proficient drivers. A case study is presented, and the path to deployment is discussed. \ No newline at end of file diff --git a/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract) b/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract) new file mode 100644 index 0000000000..f25ef285db --- /dev/null +++ b/data/2024/aaai/A Wireframe-Based Approach for Classifying and Acquiring Proficiency in the American Sign Language (Student Abstract) @@ -0,0 +1 @@ +We describe our methodology for classifying ASL (American Sign Language) gestures. Rather than operate directly on raw images of hand gestures, we extract coor-dinates and render wireframes from individual images to construct a curated training dataset. This dataset is then used in a classifier that is memory efficient and provides effective performance (94% accuracy). Because we con-struct wireframes that contain information about several angles in the joints that comprise hands, our methodolo-gy is amenable to training those interested in learning ASL by identifying targeted errors in their hand gestures. \ No newline at end of file diff --git a/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning b/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning new file mode 100644 index 0000000000..eb92a4accf --- /dev/null +++ b/data/2024/aaai/AACP: Aesthetics Assessment of Children's Paintings Based on Self-Supervised Learning @@ -0,0 +1 @@ +The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA), playing a significant role in children's education. This task presents unique challenges, such as limited available data and the requirement for evaluation metrics from multiple perspectives. However, previous approaches have relied on training large datasets and subsequently providing an aesthetics score to the image, which is not applicable to AACP. To solve this problem, we construct an aesthetics assessment dataset of children's paintings and a model based on self-supervised learning. 1) We build a novel dataset composed of two parts: the first part contains more than 20k unlabeled images of children's paintings; the second part contains 1.2k images of children's paintings, and each image contains eight attributes labeled by multiple design experts. 2) We design a pipeline that includes a feature extraction module, perception modules and a disentangled evaluation module. 3) We conduct both qualitative and quantitative experiments to compare our model's performance with five other methods using the AACP dataset. Our experiments reveal that our method can accurately capture aesthetic features and achieve state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation b/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation new file mode 100644 index 0000000000..19df27fb3f --- /dev/null +++ b/data/2024/aaai/ACAMDA: Improving Data Efficiency in Reinforcement Learning through Guided Counterfactual Data Augmentation @@ -0,0 +1 @@ +Data augmentation plays a crucial role in improving the data efficiency of reinforcement learning (RL). However, the generation of high-quality augmented data remains a significant challenge. To overcome this, we introduce ACAMDA (Adversarial Causal Modeling for Data Augmentation), a novel framework that integrates two causality-based tasks: causal structure recovery and counterfactual estimation. The unique aspect of ACAMDA lies in its ability to recover temporal causal relationships from limited non-expert datasets. The identification of the sequential cause-and-effect allows the creation of realistic yet unobserved scenarios. We utilize this characteristic to generate guided counterfactual datasets, which, in turn, substantially reduces the need for extensive data collection. By simulating various state-action pairs under hypothetical actions, ACAMDA enriches the training dataset for diverse and heterogeneous conditions. Our experimental evaluation shows that ACAMDA outperforms existing methods, particularly when applied to novel and unseen domains. \ No newline at end of file diff --git a/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning b/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning new file mode 100644 index 0000000000..bebf51c639 --- /dev/null +++ b/data/2024/aaai/ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning @@ -0,0 +1 @@ +Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT. \ No newline at end of file diff --git a/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection b/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection new file mode 100644 index 0000000000..8013ed5490 --- /dev/null +++ b/data/2024/aaai/ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection @@ -0,0 +1 @@ +Graph anomaly detection is crucial for identifying nodes that deviate from regular behavior within graphs, benefiting various domains such as fraud detection and social network. Although existing reconstruction-based methods have achieved considerable success, they may face the Anomaly Overfitting and Homophily Trap problems caused by the abnormal patterns in the graph, breaking the assumption that normal nodes are often better reconstructed than abnormal ones. Our observations indicate that models trained on graphs with fewer anomalies exhibit higher detection performance. Based on this insight, we introduce a novel two-stage framework called Anomaly-Denoised Autoencoders for Graph Anomaly Detection (ADA-GAD). In the first stage, we design a learning-free anomaly-denoised augmentation method to generate graphs with reduced anomaly levels. We pretrain graph autoencoders on these augmented graphs at multiple levels, which enables the graph autoencoders to capture normal patterns. In the next stage, the decoders are retrained for detection on the original graph, benefiting from the multi-level representations learned in the previous stage. Meanwhile, we propose the node anomaly distribution regularization to further alleviate Anomaly Overfitting. We validate the effectiveness of our approach through extensive experiments on both synthetic and real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis b/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis new file mode 100644 index 0000000000..f141f71fca --- /dev/null +++ b/data/2024/aaai/AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis @@ -0,0 +1 @@ +Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations. \ No newline at end of file diff --git a/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack b/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack new file mode 100644 index 0000000000..c1357e9b6c --- /dev/null +++ b/data/2024/aaai/AGS: Affordable and Generalizable Substitute Training for Transferable Adversarial Attack @@ -0,0 +1 @@ +In practical black-box attack scenarios, most of the existing transfer-based attacks employ pretrained models (e.g. ResNet50) as the substitute models. Unfortunately, these substitute models are not always appropriate for transfer-based attacks. Firstly, these models are usually trained on a largescale annotated dataset, which is extremely expensive and time-consuming to construct. Secondly, the primary goal of these models is to perform a specific task, such as image classification, which is not developed for adversarial attacks. To tackle the above issues, i.e., high cost and over-fitting on taskspecific models, we propose an Affordable and Generalizable Substitute (AGS) training framework tailored for transferbased adversarial attack. Specifically, we train the substitute model from scratch by our proposed adversary-centric constrastive learning. This proposed learning mechanism introduces another sample with slight adversarial perturbations as an additional positive view of the input image, and then encourages the adversarial view and two benign views to interact comprehensively with each other. To further boost the generalizability of the substitute model, we propose adversarial invariant learning to maintain the representations of the adversarial example invariants under augmentations with various strengths. Our AGS model can be trained solely with unlabeled and out-of domain data and avoid overfitting to any task-specific models, because of its inherently self-supervised nature. Extensive experiments demonstrate that our AGS achieves comparable or superior performance compared to substitute models pretrained on the complete ImageNet training set, when executing attacks across a diverse range of target models, including ViTs, robustly trained models, object detection and segmentation models. Our source codes are available at https://github.com/lwmming/AGS. \ No newline at end of file diff --git a/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards b/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards new file mode 100644 index 0000000000..b657dd6be7 --- /dev/null +++ b/data/2024/aaai/AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards @@ -0,0 +1 @@ +Intelligent system audits are labor-intensive assurance activities that are typically performed once and discarded along with the opportunity to programmatically test all similar products for the market. This study illustrates how several incidents (i.e., harms) involving Named Entity Recognition (NER) can be prevented by scaling up a previously-performed audit of NER systems. The audit instrument's diagnostic capacity is maintained through a security model that protects the underlying data (i.e., addresses Goodhart's Law). An open-source evaluation infrastructure is released along with an example derived from a real-world audit that reports aggregated findings without exposing the underlying data. \ No newline at end of file diff --git a/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures b/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures new file mode 100644 index 0000000000..e2912bbcf5 --- /dev/null +++ b/data/2024/aaai/AI Risk Profiles: A Standards Proposal for Pre-deployment AI Risk Disclosures @@ -0,0 +1 @@ +As AI systems’ sophistication and proliferation have increased, awareness of the risks has grown proportionally. The AI industry is increasingly emphasizing the need for transparency, with proposals ranging from standardizing use of technical disclosures, like model cards, to regulatory licensing regimes. Since the AI value chain is complicated, with actors bringing varied expertise, perspectives, and values, it is crucial that consumers of transparency disclosures be able to understand the risks of the AI system in question. In this paper we propose a risk profiling standard which can guide downstream decision-making, including triaging further risk assessment, informing procurement and deployment, and directing regulatory frameworks. The standard is built on our proposed taxonomy of AI risks, which distills the wide variety of risks proposed in the literature into a high-level categorization. We outline the myriad data sources needed to construct informative Risk Profiles and propose a template and methodology for collating risk information into a standard, yet flexible, structure. We apply this methodology to a number of prominent AI systems using publicly available information. To conclude, we discuss design decisions for the profiles and future work. \ No newline at end of file diff --git a/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy b/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy new file mode 100644 index 0000000000..e8fa9c173e --- /dev/null +++ b/data/2024/aaai/AI, Ethics, and Education: The Pioneering Path of Sidekick Academy @@ -0,0 +1 @@ +Generative artificial intelligence (AI) is swiftly cementing its role as an indispensable tool for students transitioning from K-12 to higher education and professional spheres. Yet, harnessing its full potential requires more than mere familiarity. Students must be equipped with the skills to engage with AI both productively and ethically. Left unchecked, AI usage can pose risks, especially if students lack proper guidance or understanding of their actions. Moreover, effective interaction with AI necessitates skills in prompt engineering to yield desired outcomes. Sidekick Academy is a digital online platform where students can safely experiment with and learn about AI. This article delves into the genesis of Sidekick Academy, offering a glimpse into its lessons on how to use AI and complex debate on ethical use. It also sheds light on the academy's "sandbox" - a secure space for students to explore AI without jeopardizing their safety or privacy. \ No newline at end of file diff --git a/data/2024/aaai/AI-Assisted Human Teamwork b/data/2024/aaai/AI-Assisted Human Teamwork new file mode 100644 index 0000000000..689c95a38c --- /dev/null +++ b/data/2024/aaai/AI-Assisted Human Teamwork @@ -0,0 +1 @@ +Effective teamwork translates to fewer preventable errors and higher task performance in collaborative tasks. However, in time-critical tasks, successful teamwork becomes highly challenging to attain. In such settings, often, team members have partial observability of their surroundings, incur high cost of communication, and have trouble estimating the state and intent of their teammates. To assist a team in improving teamwork at task time, my doctoral research proposes an automated task-time team intervention system. Grounded in the notion of shared mental models, the system first detects whether the team is on the same page or not. It then generates effective interventions to improve teamwork. Additionally, by leveraging past demonstrations to learn a model of team behavior, this system minimizes the need for domain experts to specify teamwork models and rules. \ No newline at end of file diff --git a/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System b/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System new file mode 100644 index 0000000000..656980a164 --- /dev/null +++ b/data/2024/aaai/AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System @@ -0,0 +1 @@ +The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods - their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios. \ No newline at end of file diff --git a/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity b/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity new file mode 100644 index 0000000000..8b897d25f2 --- /dev/null +++ b/data/2024/aaai/AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity @@ -0,0 +1 @@ +Visual art facilitates expression, communication, and connection, yet it remains inaccessible to those who are visually-impaired and those who lack the resources to understand the techniques and history of art. In this work, I propose the development of a generative AI model that generates a description and interpretation of a given artwork. Such research can make art more accessible, support art education, and improve the ability of AI to understand and translate between creative media. Development will begin with a formative study to assess the needs and preferences of blind and low vision people and art experts. Following the formative study, the basic approach is to train the model on a database of artworks and their accompanying descriptions, predict sentiments from extracted visual data, and generate a paragraph closely resembling training textual data and incorporating sentiment analysis. The model will then be evaluated quantitatively through metrics like METEOR and qualitatively through Turing tests in an iterative process. \ No newline at end of file diff --git a/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation b/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation new file mode 100644 index 0000000000..245c9963c7 --- /dev/null +++ b/data/2024/aaai/ALISON: Fast and Effective Stylometric Authorship Obfuscation @@ -0,0 +1,3 @@ +Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods, +new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours. +To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON. \ No newline at end of file diff --git a/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion b/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion new file mode 100644 index 0000000000..17b22b15fb --- /dev/null +++ b/data/2024/aaai/AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion @@ -0,0 +1 @@ +Generating realistic human motion sequences from text descriptions is a challenging task that requires capturing the rich expressiveness of both natural language and human motion. Recent advances in diffusion models have enabled significant progress in human motion synthesis. However, existing methods struggle to handle text inputs that describe complex or long motions. In this paper, we propose the Adaptable Motion Diffusion (AMD) model, which leverages a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts that correspond to the target motion. This process exploits the LLM’s ability to provide anatomical guidance for complex motion synthesis. We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process, which adaptively ensures the semantic fidelity and diversity of the synthesized motion. Our method can effectively handle texts with complex or long motion descriptions, where existing methods often fail. Experiments on datasets with relatively more complex motions, such as CLCD1 and CLCD2, demonstrate that our AMD significantly outperforms existing state-of-the-art models. \ No newline at end of file diff --git a/data/2024/aaai/AMD: Autoregressive Motion Diffusion b/data/2024/aaai/AMD: Autoregressive Motion Diffusion new file mode 100644 index 0000000000..5fed5817fb --- /dev/null +++ b/data/2024/aaai/AMD: Autoregressive Motion Diffusion @@ -0,0 +1,4 @@ +Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. +The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. +In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. +Furthermore, we present its generalization for X-to-Motion with “No Modality Left Behind”, enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input. \ No newline at end of file diff --git a/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection b/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection new file mode 100644 index 0000000000..e8bba779f4 --- /dev/null +++ b/data/2024/aaai/AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection @@ -0,0 +1 @@ +In this paper, we present a novel Amplitude-Modulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on non-maximum suppression with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Our code is available at https://github.com/zhoujingchun03/AMSP-UOD. \ No newline at end of file diff --git a/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning b/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning new file mode 100644 index 0000000000..4bcf5fbeab --- /dev/null +++ b/data/2024/aaai/ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning @@ -0,0 +1,14 @@ +Semi-supervised learning (SSL) methods assume that labeled +data, unlabeled data and test data are from the same distribution. Open-set semi-supervised learning (Open-set SSL) con- +siders a more practical scenario, where unlabeled data and +test data contain new categories (outliers) not observed in +labeled data (inliers). Most previous works focused on out- +lier detection via binary classifiers, which suffer from insufficient scalability and inability to distinguish different types +of uncertainty. In this paper, we propose a novel framework, +Adaptive Negative Evidential Deep Learning (ANEDL) to +tackle these limitations. Concretely, we first introduce evidential deep learning (EDL) as an outlier detector to quantify +different types of uncertainty, and design different uncertainty +metrics for self-training and inference. Furthermore, we propose a novel adaptive negative optimization strategy, making +EDL more tailored to the unlabeled dataset containing both +inliers and outliers. As demonstrated empirically, our proposed method outperforms existing state-of-the-art methods +across four datasets. \ No newline at end of file diff --git a/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries b/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries new file mode 100644 index 0000000000..4bb2eee1f1 --- /dev/null +++ b/data/2024/aaai/AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries @@ -0,0 +1 @@ +DEtection TRansformer (DETR)-based models have achieved remarkable performance. However, they are accompanied by a large computation overhead cost, which significantly prevents their applications on resource-limited devices. Prior arts attempt to reduce the computational burden of DETR using low-bit quantization, while these methods sacrifice a severe significant performance on weight-activation-attention low-bit quantization. We observe that the number of matching queries and positive samples affect much on the representation capacity of queries in DETR, while quantifying queries of DETR further reduces its representational capacity, thus leading to a severe performance drop. We introduce a new quantization strategy based on Auxiliary Queries for DETR (AQ-DETR), aiming to enhance the capacity of quantized queries. In addition, a layer-by-layer distillation is proposed to reduce the quantization error between quantized attention and full-precision counterpart. Through our extensive experiments on large-scale open datasets, the performance of the 4-bit quantization of DETR and Deformable DETR models is comparable to full-precision counterparts. \ No newline at end of file diff --git a/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network b/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network new file mode 100644 index 0000000000..a316f43054 --- /dev/null +++ b/data/2024/aaai/ASWT-SGNN: Adaptive Spectral Wavelet Transform-Based Self-Supervised Graph Neural Network @@ -0,0 +1 @@ +Graph Comparative Learning (GCL) is a self-supervised method that combines the advantages of Graph Convolutional Networks (GCNs) and comparative learning, making it promising for learning node representations. However, the GCN encoders used in these methods rely on the Fourier transform to learn fixed graph representations, which is inherently limited by the uncertainty principle involving spatial and spectral localization trade-offs. To overcome the inflexibility of existing methods and the computationally expensive eigen-decomposition and dense matrix multiplication, this paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN). The proposed method employs spectral adaptive polynomials to approximate the filter function and optimize the wavelet using contrast loss. This design enables the creation of local filters in both spectral and spatial domains, allowing flexible aggregation of neighborhood information at various scales and facilitating controlled transformation between local and global information. Compared to existing methods, the proposed approach reduces computational complexity and addresses the limitation of graph convolutional neural networks, which are constrained by graph size and lack flexible control over the neighborhood aspect. Extensive experiments on eight benchmark datasets demonstrate that ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. Furthermore, ASWT-SGNN achieves comparable performance to state-of-the-art models in node classification tasks. \ No newline at end of file diff --git a/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction b/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction new file mode 100644 index 0000000000..d7044b8a82 --- /dev/null +++ b/data/2024/aaai/AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate Prediction @@ -0,0 +1 @@ +Click-through rate (CTR) prediction is a vital task in industrial recommendation systems. Most existing methods focus on the network architecture design of the CTR model for better accuracy and suffer from the data sparsity problem. Especially in industrial recommendation systems, the widely applied negative sample down-sampling technique due to resource limitation worsens the problem, resulting in a decline in performance. In this paper, we propose Auxiliary Match Tasks for enhancing Click-Through Rate (AT4CTR) prediction accuracy by alleviating the data sparsity problem. Specifically, we design two match tasks inspired by collaborative filtering to enhance the relevance modeling between user and item. As the "click" action is a strong signal which indicates the user's preference towards the item directly, we make the first match task aim at pulling closer the representation between the user and the item regarding the positive samples. Since the user's past click behaviors can also be treated as the user him/herself, we apply the next item prediction as the second match task. For both the match tasks, we choose the InfoNCE as their loss function. The two match tasks can provide meaningful training signals to speed up the model's convergence and alleviate the data sparsity. We conduct extensive experiments on one public dataset and one large-scale industrial recommendation dataset. The result demonstrates the effectiveness of the proposed auxiliary match tasks. AT4CTR has been deployed in the real industrial advertising system and has gained remarkable revenue. \ No newline at end of file diff --git a/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT b/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT new file mode 100644 index 0000000000..817242d687 --- /dev/null +++ b/data/2024/aaai/Abstract Action Scheduling for Optimal Temporal Planning via OMT @@ -0,0 +1,2 @@ +Given the model of a system with explicit temporal constraints, optimal temporal planning is the problem of finding a schedule of actions that achieves a certain goal while optimizing an objective function. Recent approaches for optimal planning reduce the problem to a series of queries to an Optimization Modulo Theory (OMT) solver: each query encodes a bounded version of the problem, with additional abstract actions representing an over-approximation of the plans beyond the bound. This technique suffers from performance issues, mainly due to the looseness of the over-approximation, which can include many non-executable plans. +In this paper, we propose a refined abstraction for solving optimal temporal planning via OMT by introducing abstract scheduling constraints, which have a double purpose. First, they enforce a partial ordering of abstract actions based on mutual dependencies between them, which leads to a better makespan estimation and allows to prove optimality sooner. Second, they implicitly forbid circular self-enabling of abstract actions, which is a common cause of spurious models that severely affects performance in existing approaches. We prove the soundness and completeness of the resulting approach and empirically demonstrate its superiority with respect to the state of the art. \ No newline at end of file diff --git a/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning b/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning new file mode 100644 index 0000000000..e889d2fde2 --- /dev/null +++ b/data/2024/aaai/Abstract and Explore: A Novel Behavioral Metric with Cyclic Dynamics in Reinforcement Learning @@ -0,0 +1 @@ +Intrinsic motivation lies at the heart of the exploration of reinforcement learning, which is primarily driven by the agent's inherent satisfaction rather than external feedback from the environment. However, in recent more challenging procedurally-generated environments with high stochasticity and uninformative extrinsic rewards, we identify two significant issues of applying intrinsic motivation. (1) State representation collapse: In existing methods, the learned representations within intrinsic motivation have a high probability to neglect the distinction among different states and be distracted by the task-irrelevant information brought by the stochasticity. (2) Insufficient interrelation among dynamics: Unsuccessful guidance provided by the uninformative extrinsic reward makes the dynamics learning in intrinsic motivation less effective. In light of the above observations, a novel Behavioral metric with Cyclic Dynamics (BCD) is proposed, which considers both cumulative and immediate effects and facilitates the abstraction and exploration of the agent. For the behavioral metric, the successor feature is utilized to reveal the expected future rewards and alleviate the heavy reliance of previous methods on extrinsic rewards. Moreover, the latent variable and vector quantization techniques are employed to enable an accurate measurement of the transition function in a discrete and interpretable manner. In addition, cyclic dynamics is established to capture the interrelations between state and action, thereby providing a thorough awareness of environmental dynamics. Extensive experiments conducted on procedurally-generated environments demonstrate the state-of-the-art performance of our proposed BCD. \ No newline at end of file diff --git a/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures b/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures new file mode 100644 index 0000000000..504ce122ce --- /dev/null +++ b/data/2024/aaai/Abstraction of Situation Calculus Concurrent Game Structures @@ -0,0 +1 @@ +We present a general framework for abstracting agent behavior in multi-agent synchronous games in the situation calculus, which provides a first-order representation of the state and allows us to model how plays depend on the data and objects involved. We represent such games as action theories of a special form called situation calculus synchronous game structures (SCSGSs), in which we have a single action "tick" whose effects depend on the combination of moves selected by the players. In our framework, one specifies both an abstract SCSGS and a concrete SCSGS, as well as a refinement mapping that specifies how each abstract move is implemented by a Golog program defined over the concrete SCSGS. We define notions of sound and complete abstraction with respect to a mapping over such SCSGS. To express strategic properties on the abstract and concrete games we adopt a first-order variant of alternating-time mu-calculus mu-ATL-FO. We show that we can exploit abstraction in verifying mu-ATL-FO properties of SCSGSs under the assumption that agents can always execute abstract moves to completion even if not fully controlling their outcomes. \ No newline at end of file diff --git a/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning b/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning new file mode 100644 index 0000000000..67ed1d5cf8 --- /dev/null +++ b/data/2024/aaai/Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning @@ -0,0 +1 @@ +Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips. \ No newline at end of file diff --git a/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing b/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing new file mode 100644 index 0000000000..10bd5d96c8 --- /dev/null +++ b/data/2024/aaai/Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing @@ -0,0 +1,2 @@ +Recent research has introduced several approaches to formally verify the robustness of neural network models against perturbations in their inputs, such as the ones that occur in adversarial attacks. At the same time, this particular verification task is known to be computationally challenging. More specifically, assessing the robustness of a neural network against input perturbations can easily take several hours of compute time per input vector, even when using state-of-the-art verification approaches. In light of this, it becomes challenging to select from a given set of neural network models the one that is best in terms of robust accuracy, i.e., the fraction of instances for which the model is known to be robust against adversarial perturbations, especially when given limited computing resources. +To tackle this problem, we propose a racing method specifically adapted to the domain of robustness verification. This racing method utilises Delta-values, which can be seen as an efficiently computable proxy for the distance of a given input to a neural network model to the decision boundary. We present statistical evidence indicating significant differences in the empirical cumulative distribution between robust and non-robust inputs as a function of Delta-values. Using this information, we show that it is possible to reliably expose vulnerabilities in the model with relatively few input iterations. Overall, when applied to selecting the most robust network from sets of 31 MNIST and 27 CIFAR-10 networks, our proposed method achieves speedups of a factor of 108 and 42, respectively, in terms of cumulative running time compared to standard local robustness verification on the complete testing sets. \ No newline at end of file diff --git a/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates b/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates new file mode 100644 index 0000000000..e53a5f03ec --- /dev/null +++ b/data/2024/aaai/Accelerating Cutting-Plane Algorithms via Reinforcement Learning Surrogates @@ -0,0 +1,22 @@ +Discrete optimization belongs to the set of N P-hard +problems, spanning fields such as mixed-integer +programming and combinatorial optimization. A current +standard approach to solving convex discrete optimization +problems is the use of cutting-plane algorithms, which +reach optimal solutions by iteratively adding inequalities +known as cuts to refine a feasible set. Despite the existence +of a number of general-purpose cut-generating algorithms, +large-scale discrete optimization problems continue to suffer +from intractability. In this work, we propose a method for +accelerating cutting-plane algorithms via reinforcement +learning. Our approach uses learned policies as surrogates +for N P-hard elements of the cut generating procedure +in a way that (i) accelerates convergence, and (ii) retains +guarantees of optimality. We apply our method on two types +of problems where cutting-plane algorithms are commonly +used: stochastic optimization, and mixed-integer quadratic +programming. We observe the benefits of our method when +applied to Benders decomposition (stochastic optimization) +and iterative loss approximation (quadratic programming), +achieving up to 45% faster average convergence when +compared to modern alternative algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference b/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference new file mode 100644 index 0000000000..3ee0b3e315 --- /dev/null +++ b/data/2024/aaai/Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference @@ -0,0 +1,3 @@ +Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. + +To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can 1) automatically identify the affected image regions and 2) utilize the cached unchanged regions' feature map to accelerate the inference process. For the former, we measure the differences between cached and ad hoc feature maps given the modified textual description, extract the region with significant differences, and capture the affected region by masks. For the latter, we develop an efficient sparse diffusion inference engine that only computes the feature maps for the affected region while reusing the cached statistics for the rest of the image. Finally, extensive empirical results show that FISEdit can be 3.4 times and 4.4 times faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images. \ No newline at end of file diff --git a/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations b/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations new file mode 100644 index 0000000000..d20e33c3c1 --- /dev/null +++ b/data/2024/aaai/Accelerating the Global Aggregation of Local Explanations @@ -0,0 +1,7 @@ +Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model. +Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses. +However, standard aggregation methods bear a high computational cost: +a naive implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. + +We devise techniques for accelerating the global aggregation of the Anchor algorithm. Specifically, our goal is to compute a set of top-k words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy. +We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30 times, reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact. \ No newline at end of file diff --git a/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint) b/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint) new file mode 100644 index 0000000000..a8c0ad28ac --- /dev/null +++ b/data/2024/aaai/Accurate Parameter Estimation for Safety-Critical Systems with Unmodeled Dynamics (Abstract Reprint) @@ -0,0 +1 @@ +Analysis and synthesis of safety-critical autonomous systems are carried out using models which are often dynamic. Two central features of these dynamic systems are parameters and unmodeled dynamics. Much of feedback control design is parametric in nature and as such, accurate and fast estimation of the parameters in the modeled part of the dynamic system is a crucial property for designing risk-aware autonomous systems. This paper addresses the use of a spectral lines-based approach for estimating parameters of the dynamic model of an autonomous system. Existing literature has treated all unmodeled components of the dynamic system as sub-Gaussian noise and proposed parameter estimation using Gaussian noise-based exogenous signals. In contrast, we allow the unmodeled part to have deterministic unmodeled dynamics, which are almost always present in physical systems, in addition to sub-Gaussian noise. In addition, we propose a deterministic construction of the exogenous signal in order to carry out parameter estimation. We introduce a new tool kit which employs the theory of spectral lines, retains the stochastic setting, and leads to non-asymptotic bounds on the parameter estimation error. Unlike the existing stochastic approach, these bounds are tunable through an optimal choice of the spectrum of the exogenous signal leading to accurate parameter estimation. We also show that this estimation is robust to unmodeled dynamics, a property that is not assured by the existing approach. Finally, we show that under ideal conditions with no deterministic unmodeled dynamics, the proposed approach can ensure a Õ(√t) Regret, matching existing literature. Experiments are provided to support all theoretical derivations, which show that the spectral lines-based approach outperforms the Gaussian noise-based method when unmodeled dynamics are present, in terms of both parameter estimation error and Regret obtained using the parameter estimates with a Linear Quadratic Regulator in feedback. \ No newline at end of file diff --git a/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners b/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners new file mode 100644 index 0000000000..40c6884fee --- /dev/null +++ b/data/2024/aaai/Active Learning Guided by Efficient Surrogate Learners @@ -0,0 +1 @@ +Re-training a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampling, ultimately eroding the efficacy of the labeling procedure. In this paper, we introduce a new AL algorithm that harnesses the power of a Gaussian process surrogate in conjunction with the neural network principal learner. Our proposed model adeptly updates the surrogate learner for every new data instance, enabling it to emulate and capitalize on the continuous learning dynamics of the neural network without necessitating a complete retraining of the principal model for each individual label. Experiments on four benchmark datasets demonstrate that this approach yields significant enhancements, either rivaling or aligning with the performance of state-of-the-art techniques. \ No newline at end of file diff --git a/data/2024/aaai/Active Reinforcement Learning for Robust Building Control b/data/2024/aaai/Active Reinforcement Learning for Robust Building Control new file mode 100644 index 0000000000..0609cbe82b --- /dev/null +++ b/data/2024/aaai/Active Reinforcement Learning for Robust Building Control @@ -0,0 +1 @@ +Reinforcement learning (RL) is a powerful tool for optimal control that has found great success in Atari games, the game of Go, robotic control, and building optimization. RL is also very brittle; agents often overfit to their training environment and fail to generalize to new settings. Unsupervised environment design (UED) has been proposed as a solution to this problem, in which the agent trains in environments that have been specially selected to help it learn. Previous UED algorithms focus on trying to train an RL agent that generalizes across a large distribution of environments. This is not necessarily desirable when we wish to prioritize performance in one environment over others. In this work, we will be examining the setting of robust RL building control, where we wish to train an RL agent that prioritizes performing well in normal weather while still being robust to extreme weather conditions. We demonstrate a novel UED algorithm, ActivePLR, that uses uncertainty-aware neural network architectures to generate new training environments at the limit of the RL agent's ability while being able to prioritize performance in a desired base environment. We show that ActivePLR is able to outperform state-of-the-art UED algorithms in minimizing energy usage while maximizing occupant comfort in the setting of building control. \ No newline at end of file diff --git a/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint) b/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint) new file mode 100644 index 0000000000..3706841fd1 --- /dev/null +++ b/data/2024/aaai/Actor Prioritized Experience Replay (Abstract Reprint) @@ -0,0 +1 @@ +A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations b/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations new file mode 100644 index 0000000000..d4dc4db5d7 --- /dev/null +++ b/data/2024/aaai/Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations @@ -0,0 +1 @@ +Retrieval models aim at selecting a small set of item candidates which match the preference of a given user. They play a vital role in large-scale recommender systems since subsequent models such as rankers highly depend on the quality of item candidates. However, most existing retrieval models employ a single-round inference paradigm, which may not adequately capture the dynamic nature of user preferences and stuck in one area in the item space. In this paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems that iteratively refines user representations to better capture potential candidates in the full item space. Ada-Retrieval comprises two key modules: the item representation adapter and the user representation adapter, designed to inject context information into items' and users' representations. The framework maintains a model-agnostic design, allowing seamless integration with various backbone models such as RNNs or Transformers. We perform experiments on three widely used public datasets, incorporating five powerful sequential recommenders as backbone models. Our results demonstrate that Ada-Retrieval significantly enhances the performance of various base models, with consistent improvements observed across different datasets. Our code and data are publicly available at: https://github.com/ll0ruc/Ada-Retrieval. \ No newline at end of file diff --git a/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection b/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection new file mode 100644 index 0000000000..ca638f42e9 --- /dev/null +++ b/data/2024/aaai/AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection @@ -0,0 +1 @@ +Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning. \ No newline at end of file diff --git a/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution b/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution new file mode 100644 index 0000000000..8b78070942 --- /dev/null +++ b/data/2024/aaai/AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution @@ -0,0 +1 @@ +Efficient transformer-based models have made remarkable progress in image super-resolution (SR). Most of these works mainly design elaborate structures to accelerate the inference of the transformer, where all feature tokens are propagated equally. However, they ignore the underlying characteristic of image content, i.e., various image regions have distinct restoration difficulties, especially for large images (2K-8K), failing to achieve adaptive inference. In this work, we propose an adaptive token sparsification transformer (AdaFormer) to speed up the model inference for image SR. Specifically, a texture-relevant sparse attention block with parallel global and local branches is introduced, aiming to integrate informative tokens from the global view instead of only in fixed local windows. Then, an early-exit strategy is designed to progressively halt tokens according to the token importance. To estimate the plausibility of each token, we adopt a lightweight confidence estimator, which is constrained by an uncertainty-guided loss to obtain a binary halting mask about the tokens. Experiments on large images have illustrated that our proposal reduces nearly 90% latency against SwinIR on Test8K, while maintaining a comparable performance. \ No newline at end of file diff --git a/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing b/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing new file mode 100644 index 0000000000..b8a9a7a713 --- /dev/null +++ b/data/2024/aaai/AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing @@ -0,0 +1 @@ +With the great success of text-conditioned diffusion models in creative text-to-image generation, various text-driven image editing approaches have attracted the attentions of many researchers. However, previous works mainly focus on discreteness-sensitive instructions such as adding, removing or replacing specific objects, background elements or global styles (i.e., “hard editing”), while generally ignoring subject-binding but semantically fine-changing continuity-sensitive instructions such as actions, poses or adjectives, and so on (i.e., “soft editing”), which hampers generative AI from generating user-customized visual contents. To mitigate this predicament, we propose a spatio-temporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing by introducing a soft-attention strategy to dynamically vary the guiding degree from the editing conditions to visual pixels from both temporal and spatial perspectives. Note our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning, extra data, or optimization. We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches. Code is available: https://github.com/AnonymousPony/adap-edit. \ No newline at end of file diff --git a/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning b/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning new file mode 100644 index 0000000000..37a24e350e --- /dev/null +++ b/data/2024/aaai/Adapted Weighted Aggregation in Federated Learning @@ -0,0 +1,2 @@ +This study introduces FedAW, a novel federated learning algorithm that uses a weighted aggregation mechanism sensitive to the quality of client datasets, leading to better model +performance and faster convergence on diverse datasets, validated using Colored MNIST. \ No newline at end of file diff --git a/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs b/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs new file mode 100644 index 0000000000..95d0fedc1c --- /dev/null +++ b/data/2024/aaai/AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs @@ -0,0 +1 @@ +Fine-tuning pre-trained models has recently yielded remarkable performance gains in graph neural networks (GNNs). In addition to pre-training techniques, inspired by the latest work in the natural language fields, more recent work has shifted towards applying effective fine-tuning approaches, such as parameter-efficient fine-tuning (PEFT). However, given the substantial differences between GNNs and transformer-based models, applying such approaches directly to GNNs proved to be less effective. In this paper, we present a comprehensive comparison of PEFT techniques for GNNs and propose a novel PEFT method specifically designed for GNNs, called AdapterGNN. AdapterGNN preserves the knowledge of the large pre-trained model and leverages highly expressive adapters for GNNs, which can adapt to downstream tasks effectively with only a few parameters, while also improving the model's generalization ability. Extensive experiments show that AdapterGNN achieves higher performance than other PEFT methods and is the only one consistently surpassing full fine-tuning (outperforming it by 1.6% and 5.7% in the chemistry and biology domains respectively, with only 5% and 4% of its parameters tuned) with lower generalization gaps. Moreover, we empirically show that a larger GNN model can have a worse generalization ability, which differs from the trend observed in large transformer-based models. Building upon this, we provide a theoretical justification for PEFT can improve generalization of GNNs by applying generalization bounds. Our code is available at https://github.com/Lucius-lsr/AdapterGNN. \ No newline at end of file diff --git a/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract) b/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract) new file mode 100644 index 0000000000..7d7f24ba3d --- /dev/null +++ b/data/2024/aaai/Adapting Animal Models to Assess Sufficiency of Fluid Resuscitation in Humans (Student Abstract) @@ -0,0 +1 @@ +Fluid resuscitation is an initial treatment frequently employed to treat shock, restore lost blood, protect tissues from injury, and prevent organ dysfunction in critically ill patients. However, it is not without risk (e.g., overly aggressive resuscitation may cause organ damage and even death). We leverage machine learning models trained to assess sufficiency of resuscitation in laboratory animals subjected to induced hemorrhage and transfer them to use with human trauma patients. Our key takeaway is that animal experiments and models can inform human healthcare, especially when human data is limited or when collecting relevant human data via potentially harmful protocols is unfeasible. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery b/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery new file mode 100644 index 0000000000..467e1d58a0 --- /dev/null +++ b/data/2024/aaai/Adaptive Discovering and Merging for Incremental Novel Class Discovery @@ -0,0 +1 @@ +One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptively in the incremental stage and integrate novel knowledge into the model without affecting the original knowledge. To discover novel classes adaptively, we decouple representation learning and novel class discovery, and use Triple Comparison (TC) and Probability Regularization (PR) to constrain the probability discrepancy and diversity for adaptive category assignment. To merge the learned novel knowledge adaptively, we propose a hybrid structure with base and novel branches named Adaptive Model Merging (AMM), which reduces the interference of the novel branch on the old classes to preserve the previous knowledge, and merges the novel branch to the base model without performance loss and parameter growth. Extensive experiments on several datasets show that ADM significantly outperforms existing class-incremental Novel Class Discovery (class-iNCD) approaches. Moreover, our AMM also benefits the class-incremental Learning (class-IL) task by alleviating the catastrophic forgetting problem. The source code is included in the supplementary materials. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement b/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement new file mode 100644 index 0000000000..6c0b918004 --- /dev/null +++ b/data/2024/aaai/Adaptive FSS: A Novel Few-Shot Segmentation Framework via Prototype Enhancement @@ -0,0 +1 @@ +The Few-Shot Segmentation (FSS) aims to accomplish the novel class segmentation task with a few annotated images. Current FSS research based on meta-learning focuses on designing a complex interaction mechanism between the query and support feature. However, unlike humans who can rapidly learn new things from limited samples, the existing approach relies solely on fixed feature matching to tackle new tasks, lacking adaptability. In this paper, we propose a novel framework based on the adapter mechanism, namely Adaptive FSS, which can efficiently adapt the existing FSS model to the novel classes. In detail, we design the Prototype Adaptive Module (PAM), which utilizes accurate category information provided by the support set to derive class prototypes, enhancing class-specific information in the multi-stage representation. In addition, our approach is compatible with diverse FSS methods with different backbones by simply inserting PAM between the layers of the encoder. Experiments demonstrate that our method effectively improves the performance of the FSS models (e.g., MSANet, HDMNet, FPTrans, and DCAMA) and achieves new state-of-the-art (SOTA) results (i.e., 72.4% and 79.1% mIoU on PASCAL-5i 1-shot and 5-shot settings, 52.7% and 60.0% mIoU on COCO-20i 1-shot and 5-shot settings). Our code is available at https://github.com/jingw193/AdaptiveFSS. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering b/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering new file mode 100644 index 0000000000..4965569c7d --- /dev/null +++ b/data/2024/aaai/Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering @@ -0,0 +1 @@ +In recent years, incomplete multi-view clustering (IMVC), which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Previous IMVC methods suffer from the following issues: (1) the inaccurate imputation for missing data, which leads to suboptimal clustering performance, and (2) most existing IMVC models merely consider the explicit presence of graph structure in data, ignoring the fact that latent graphs of different views also provide valuable information for the clustering task. To overcome such challenges, we present a novel method, termed Adaptive feature imputation with latent graph for incomplete multi-view clustering (AGDIMC). Specifically, it captures the embbedded features of each view by incorporating the view-specific deep encoders. Then, we construct partial latent graphs on complete data, which can consolidate the intrinsic relationships within each view while preserving the topological information. With the aim of estimating the missing sample based on the available information, we utilize an adaptive imputation layer to impute the embedded feature of missing data by using cross-view soft cluster assignments and global cluster centroids. As the imputation progresses, the portion of complete data increases, contributing to enhancing the discriminative information contained in global pseudo-labels. Meanwhile, to alleviate the negative impact caused by inferior impute samples and the discrepancy of cluster structures, we further design an adaptive imputation strategy based on the global pseudo-label and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection b/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection new file mode 100644 index 0000000000..5dc3dec37c --- /dev/null +++ b/data/2024/aaai/Adaptive Graph Learning for Multimodal Conversational Emotion Detection @@ -0,0 +1 @@ +Multimodal Emotion Recognition in Conversations (ERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts encounter challenges in balancing intra- and inter-speaker context dependencies when tackling intra-modal interactions. This balance is vital as it encompasses modeling self-dependency (emotional inertia) where speakers' own emotions affect them and modeling interpersonal dependencies (empathy) where counterparts' emotions influence a speaker. Furthermore, challenges arise in addressing cross-modal interactions that involve content with conflicting emotions across different modalities. To address this issue, we introduce an adaptive interactive graph network (IGN) called AdaIGN that employs the Gumbel Softmax trick to adaptively select nodes and edges, enhancing intra- and cross-modal interactions. Unlike undirected graphs, we use a directed IGN to prevent future utterances from impacting the current one. Next, we propose Node- and Edge-level Selection Policies (NESP) to guide node and edge selection, along with a Graph-Level Selection Policy (GSP) to integrate the utterance representation from original IGN and NESP-enhanced IGN. Moreover, we design a task-specific loss function that prioritizes text modality and intra-speaker context selection. To reduce computational complexity, we use pre-defined pseudo labels through self-supervised methods to mask unnecessary utterance nodes for selection. Experimental results show that AdaIGN outperforms state-of-the-art methods on two popular datasets. Our code will be available at https://github.com/TuGengs/AdaIGN. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering b/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering new file mode 100644 index 0000000000..f2c101c4b6 --- /dev/null +++ b/data/2024/aaai/Adaptive Hardness Negative Sampling for Collaborative Filtering @@ -0,0 +1 @@ +Negative sampling is essential for implicit collaborative filtering to provide proper negative training signals so as to achieve desirable performance. We experimentally unveil a common limitation of all existing negative sampling methods that they can only select negative samples of a fixed hardness level, leading to the false positive problem (FPP) and false negative problem (FNP). We then propose a new paradigm called adaptive hardness negative sampling (AHNS) and discuss its three key criteria. By adaptively selecting negative samples with appropriate hardnesses during the training process, AHNS can well mitigate the impacts of FPP and FNP. Next, we present a concrete instantiation of AHNS called AHNS_{p \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning b/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning new file mode 100644 index 0000000000..a5a59f2052 --- /dev/null +++ b/data/2024/aaai/Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning @@ -0,0 +1 @@ +There has been significant attention devoted to the effectiveness of various domains, such as semi-supervised learning, contrastive learning, and meta-learning, in enhancing the performance of methods for noisy label learning (NLL) tasks. However, most existing methods still depend on prior assumptions regarding clean samples amidst different sources of noise (e.g., a pre-defined drop rate or a small subset of clean samples). In this paper, we propose a simple yet powerful idea called NPN, which revolutionizes Noisy label learning by integrating Partial label learning (PLL) and Negative learning (NL). Toward this goal, we initially decompose the given label space adaptively into the candidate and complementary labels, thereby establishing the conditions for PLL and NL. We propose two adaptive data-driven paradigms of label disambiguation for PLL: hard disambiguation and soft disambiguation. Furthermore, we generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision. To maintain label reliability during the later stage of model training, we introduce a consistency regularization term that encourages agreement between the outputs of multiple augmentations. Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods. The source code has been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NPN. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction b/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction new file mode 100644 index 0000000000..ccbb47b2ca --- /dev/null +++ b/data/2024/aaai/Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction @@ -0,0 +1 @@ +Long sequence prediction has broad and significant application value in fields such as finance, wind power, and weather. However, the complex long-term dependencies of long sequence data and the potential domain shift problems limit the effectiveness of traditional models in practical scenarios. To this end, we propose an Adaptive Meta-Learning Probabilistic Inference Framework (AMPIF) based on sequence decomposition, which can effectively enhance the long sequence prediction ability of various basic models. Specifically, first, we decouple complex sequences into seasonal and trend components through a frequency domain decomposition module. Then, we design an adaptive meta-learning task construction strategy, which divides the seasonal and trend components into different tasks through a clustering-matching approach. Finally, we design a dual-stream amortized network (ST-DAN) to capture shared information between seasonal-trend tasks and use the support set to generate task-specific parameters for rapid generalization learning on the query set. We conducted extensive experiments on six datasets, including wind power and finance scenarios, and the results show that our method significantly outperforms baseline methods in prediction accuracy, interpretability, and algorithm stability and can effectively enhance the long sequence prediction capabilities of base models. The source code is publicly available at https://github.com/Zhu-JP/AMPIF. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models b/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models new file mode 100644 index 0000000000..e31b7b4e36 --- /dev/null +++ b/data/2024/aaai/Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models @@ -0,0 +1 @@ +Recently, arbitrary text style transfer (TST) has made significant progress with the paradigm of prompt learning. In this paradigm, researchers often design or search for a fixed prompt for any input. However, existing evidence shows that large language models (LLMs) are prompt-sensitive and it is sub-optimal to apply the same prompt to any input for downstream TST tasks. Besides, the prompts obtained by searching are often unreadable and unexplainable to humans. To address these issues, we propose an Adaptive Prompt Routing (APR) framework to adaptively route prompts from a human-readable prompt set for various input texts and given styles. Specifically, we first construct a candidate prompt set of diverse and human-readable prompts for the target style. This set consists of several seed prompts and their variants paraphrased by an LLM. Subsequently, we train a prompt routing model to select the optimal prompts efficiently according to inputs. The adaptively selected prompt can guide the LLMs to perform a precise style transfer for each input sentence while maintaining readability for humans. Extensive experiments on 4 public TST benchmarks over 3 popular LLMs (with parameter sizes ranging from 1.5B to 175B) demonstrate that our APR achieves superior style transfer performances, compared to the state-of-the-art prompt-based and fine-tuning methods. The source code is available at https://github.com/DwyaneLQY/APR \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories b/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories new file mode 100644 index 0000000000..0a52ec1721 --- /dev/null +++ b/data/2024/aaai/Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories @@ -0,0 +1 @@ +Reactive synthesis is the process of generate correct con- trollers from temporal logic specifications. Typically, synthesis is restricted to Boolean specifications in LTL. Recently, a Boolean abstraction technique allows to translate LTLT specifications that contain literals in theories into equi-realizable LTL specifications, but no full synthesis procedure exists yet. In synthesis modulo theories, the system receives valuations of environment variables (from a first-order theory T ) and outputs valuations of system variables from T . In this paper, we address how to syntheize a full controller using a combination of the static Boolean controller obtained from the Booleanized LTL specification together with on-the-fly queries to a solver that produces models of satisfiable existential T formulae. This is the first synthesis method for LTL modulo theories. Additionally, our method can produce adaptive responses which increases explainability and can improve runtime properties like performance. Our approach is applicable to both LTL modulo theories and LTLf modulo theories. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning b/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning new file mode 100644 index 0000000000..1103f558e5 --- /dev/null +++ b/data/2024/aaai/Adaptive Shortcut Debiasing for Online Continual Learning @@ -0,0 +1 @@ +We propose a novel framework DropTop that suppresses the shortcut bias in online continual learning (OCL) while being adaptive to the varying degree of the shortcut bias incurred by continuously changing environment. By the observed high-attention property of the shortcut bias, highly-activated features are considered candidates for debiasing. More importantly, resolving the limitation of the online environment where prior knowledge and auxiliary data are not ready, two novel techniques---feature map fusion and adaptive intensity shifting---enable us to automatically determine the appropriate level and proportion of the candidate shortcut features to be dropped. Extensive experiments on five benchmark datasets demonstrate that, when combined with various OCL algorithms, DropTop increases the average accuracy by up to 10.4% and decreases the forgetting by up to 63.2%. \ No newline at end of file diff --git a/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval b/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval new file mode 100644 index 0000000000..73c45905df --- /dev/null +++ b/data/2024/aaai/Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval @@ -0,0 +1 @@ +Text-based person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL. \ No newline at end of file diff --git a/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities b/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities new file mode 100644 index 0000000000..1743a0fa99 --- /dev/null +++ b/data/2024/aaai/Addressing Digital and AI Skills Gaps in European Living Areas: A Comparative Analysis of Small and Large Communities @@ -0,0 +1 @@ +As Artificial Intelligence (AI) continues to permeate various aspects of societies, understanding the disparities in AI knowledge and skills across different living areas becomes imperative. Small living areas have emerged as significant contributors to Europe's economy, offering an alternative to the bustling environment of larger cities for those seeking an improved quality of life. Nonetheless, they often encounter challenges related to digital infrastructure, access to financial resources, and digital skills gaps, limiting their economic and social growth prospects. This study investigates the digital and AI skills gaps in the context of small and large European living areas, shedding light on the potential hindrances to unleashing the full economic and social potentials of these regions in an AI-enabled economy. Drawing from a comprehensive dataset encompassing 4,006 respondents across eight EU countries, this research examines the current perceptions and understandings of AI and digital skills within two distinct population groups: residents of smaller living areas and their counterparts in larger communities. Through bivariate analysis, notable insights are revealed concerning trust in AI solutions and entities, self-assessed digital skills, AI Awareness, AI Attitudes and demography variables in both population groups. These insights may refer to the significance of addressing digital and AI skills gaps in fostering growth and preparedness for the AI-driven future. As AI becomes increasingly integral to various aspects of society, targeted interventions and policies are essential to bridge these gaps and enable individuals and communities to harness the transformative potential of AI-enabled economies. \ No newline at end of file diff --git a/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model b/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model new file mode 100644 index 0000000000..969d7cd8f6 --- /dev/null +++ b/data/2024/aaai/Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model @@ -0,0 +1 @@ +Adversarial attacks involve adding perturbations to the source image to cause misclassification by the target model, which demonstrates the potential of attacking face recognition models. Existing adversarial face image generation methods still can’t achieve satisfactory performance because of low transferability and high detectability. In this paper, we propose a unified framework Adv-Diffusion that can generate imperceptible adversarial identity perturbations in the latent space but not the raw pixel space, which utilizes strong inpainting capabilities of the latent diffusion model to generate realistic adversarial images. Specifically, we propose the identity-sensitive conditioned diffusion generative model to generate semantic perturbations in the surroundings. The designed adaptive strength-based adversarial perturbation algorithm can ensure both attack transferability and stealthiness. Extensive qualitative and quantitative experiments on the public FFHQ and CelebA-HQ datasets prove the proposed method achieves superior performance compared with the state-of-the-art methods without an extra generative model training process. The source code is available at https://github.com/kopper-xdu/Adv-Diffusion. \ No newline at end of file diff --git a/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization b/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization new file mode 100644 index 0000000000..4a2308b34c --- /dev/null +++ b/data/2024/aaai/AdvST: Revisiting Data Augmentations for Single Domain Generalization @@ -0,0 +1 @@ +Single domain generalization (SDG) aims to train a robust model against unknown target domain shifts using data from a single source domain. Data augmentation has been proven an effective approach to SDG. However, the utility of standard augmentations, such as translate, or invert, has not been fully exploited in SDG; practically, these augmentations are used as a part of a data preprocessing procedure. Although it is intuitive to use many such augmentations to boost the robustness of a model to out-of-distribution domain shifts, we lack a principled approach to harvest the benefit brought from multiple these augmentations. Here, we conceptualize standard data augmentations with learnable parameters as semantics transformations that can manipulate certain semantics of a sample, such as the geometry or color of an image. Then, we propose Adversarial learning with Semantics Transformations (AdvST) that augments the source domain data with semantics transformations and learns a robust model with the augmented data. We theoretically show that AdvST essentially optimizes a distributionally robust optimization objective defined on a set of semantics distributions induced by the parameters of semantics transformations. We demonstrate that AdvST can produce samples that expand the coverage on target domain data. Compared with the state-of-the-art methods, AdvST, despite being a simple method, is surprisingly competitive and achieves the best average SDG performance on the Digits, PACS, and DomainNet datasets. Our code is available at https://github.com/gtzheng/AdvST. \ No newline at end of file diff --git a/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark b/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark new file mode 100644 index 0000000000..a46bb3fa7a --- /dev/null +++ b/data/2024/aaai/Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark @@ -0,0 +1 @@ +Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, improving spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT’s spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. To improve spatial reasoning, we deploy Chain-of-Thought and Tree-of-thoughts prompting strategies, offering insights into GPT’s cognitive process. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities. \ No newline at end of file diff --git a/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model b/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model new file mode 100644 index 0000000000..3735dfc773 --- /dev/null +++ b/data/2024/aaai/Advancing Video Synchronization with Fractional Frame Analysis: Introducing a Novel Dataset and Model @@ -0,0 +1 @@ +Multiple views play a vital role in 3D pose estimation tasks. Ideally, multi-view 3D pose estimation tasks should directly utilize naturally collected videos for pose estimation. However, due to the constraints of video synchronization, existing methods often use expensive hardware devices to synchronize the initiation of cameras, which restricts most 3D pose collection scenarios to indoor settings. Some recent works learn deep neural networks to align desynchronized datasets derived from synchronized cameras and can only produce frame-level accuracy. For fractional frame video synchronization, this work proposes an Inter-Frame and Intra-Frame Desynchronized Dataset (IFID), which labels fractional time intervals between two video clips. IFID is the first dataset that annotates inter-frame and intra-frame intervals, with a total of 382,500 video clips annotated, making it the largest dataset to date. We also develop a novel model based on the Transformer architecture, named InSynFormer, for synchronizing inter-frame and intra-frame. Extensive experimental evaluations demonstrate its promising performance. The dataset and source code of the model are available at https://github.com/yuxuan-cser/InSynFormer. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms b/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms new file mode 100644 index 0000000000..3bd9139b2f --- /dev/null +++ b/data/2024/aaai/Adversarial Attacks on Federated-Learned Adaptive Bitrate Algorithms @@ -0,0 +1 @@ +Learning-based adaptive bitrate (ABR) algorithms have revolutionized video streaming solutions. With the growing demand for data privacy and the rapid development of mobile devices, federated learning (FL) has emerged as a popular training method for neural ABR algorithms in both academia and industry. However, we have discovered that FL-based ABR models are vulnerable to model-poisoning attacks as local updates remain unseen during global aggregation. In response, we propose MAFL (Malicious ABR model based on Federated Learning) to prove that backdooring the learning-based ABR model via FL is practical. Instead of attacking the global policy, MAFL only targets a single ``target client''. Moreover, the unique challenges brought by deep reinforcement learning (DRL) make the attack even more challenging. To address these challenges, MAFL is designed with a two-stage attacking mechanism. Using two representative attack cases with real-world traces, we show that MAFL significantly degrades the model performance on the target client (i.e., increasing rebuffering penalty by 2x and 5x) with a minimal negative impact on benign clients. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization b/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization new file mode 100644 index 0000000000..6256e049ad --- /dev/null +++ b/data/2024/aaai/Adversarial Attacks on the Interpretation of Neuron Activation Maximization @@ -0,0 +1 @@ +Feature visualization is one of the most popular techniques used to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding synthetic or natural inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of finetuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the classification task with ImageNet. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Fairness Network b/data/2024/aaai/Adversarial Fairness Network new file mode 100644 index 0000000000..3f59ced2e8 --- /dev/null +++ b/data/2024/aaai/Adversarial Fairness Network @@ -0,0 +1 @@ +Fairness is becoming a rising concern in machine learning. Recent research has discovered that state-of-the-art models are amplifying social bias by making biased prediction towards some population groups (characterized by sensitive features like race or gender). Such unfair prediction among groups renders trust issues and ethical concerns in machine learning, especially for sensitive fields such as employment, criminal justice, and trust score assessment. In this paper, we introduce a new framework to improve machine learning fairness. The goal of our model is to minimize the influence of sensitive feature from the perspectives of both data input and predictive model. To achieve this goal, we reformulate the data input by eliminating the sensitive information and strengthen model fairness by minimizing the marginal contribution of the sensitive feature. We propose to learn the sensitive-irrelevant input via sampling among features and design an adversarial network to minimize the dependence between the reformulated input and the sensitive information. Empirical results validate that our model achieves comparable or better results than related state-of-the-art methods w.r.t. both fairness metrics and prediction performance. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training b/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training new file mode 100644 index 0000000000..7f989495a9 --- /dev/null +++ b/data/2024/aaai/Adversarial Initialization with Universal Adversarial Perturbation: A New Approach to Fast Adversarial Training @@ -0,0 +1 @@ +Traditional adversarial training, while effective at improving machine learning model robustness, is computationally intensive. Fast Adversarial Training (FAT) addresses this by using a single-step attack to generate adversarial examples more efficiently. Nonetheless, FAT is susceptible to a phenomenon known as catastrophic overfitting, wherein the model's adversarial robustness abruptly collapses to zero during the training phase. To address this challenge, recent studies have suggested adopting adversarial initialization with Fast Gradient Sign Method Adversarial Training (FGSM-AT), which recycles adversarial perturbations from prior epochs by computing gradient momentum. However, our research has uncovered a flaw in this approach. Given that data augmentation is employed during the training phase, the samples in each epoch are not identical. Consequently, the method essentially yields not the adversarial perturbation of a singular sample, but rather the Universal Adversarial Perturbation (UAP) of a sample and its data augmentation. This insight has led us to explore the potential of using UAPs for adversarial initialization within the context of FGSM-AT. We have devised various strategies for adversarial initialization utilizing UAPs, including single, class-based, and feature-based UAPs. Experiments conducted on three distinct datasets demonstrate that our method achieves an improved trade-off among robustness, computational cost, and memory footprint. Code is available at https://github.com/fzjcdt/fgsm-uap. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis b/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis new file mode 100644 index 0000000000..3a662335f5 --- /dev/null +++ b/data/2024/aaai/Adversarial Purification with the Manifold Hypothesis @@ -0,0 +1 @@ +In this work, we formulate a novel framework for adversarial robustness using the manifold hypothesis. This framework provides sufficient conditions for defending against adversarial examples. We develop an adversarial purification method with this framework. Our method combines manifold learning with variational inference to provide adversarial robustness without the need for expensive adversarial training. Experimentally, our approach can provide adversarial robustness even if attackers are aware of the existence of the defense. In addition, our method can also serve as a test-time defense mechanism for variational autoencoders. \ No newline at end of file diff --git a/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles b/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles new file mode 100644 index 0000000000..cfa36a9139 --- /dev/null +++ b/data/2024/aaai/Adversarial Socialbots Modeling Based on Structural Information Principles @@ -0,0 +1 @@ +The importance of effective detection is underscored by the fact that socialbots imitate human behavior to propagate misinformation, leading to an ongoing competition between socialbots and detectors. Despite the rapid advancement of reactive detectors, the exploration of adversarial socialbot modeling remains incomplete, significantly hindering the development of proactive detectors. To address this issue, we propose a mathematical Structural Information principles-based Adversarial Socialbots Modeling framework, namely SIASM, to enable more accurate and effective modeling of adversarial behaviors. First, a heterogeneous graph is presented to integrate various users and rich activities in the original social network and measure its dynamic uncertainty as structural entropy. By minimizing the high-dimensional structural entropy, a hierarchical community structure of the social network is generated and referred to as the optimal encoding tree. Secondly, a novel method is designed to quantify influence by utilizing the assigned structural entropy, which helps reduce the computational cost of SIASM by filtering out uninfluential users. Besides, a new conditional structural entropy is defined between the socialbot and other users to guide the follower selection for network influence maximization. Extensive and comparative experiments on both homogeneous and heterogeneous social networks demonstrate that, compared with state-of-the-art baselines, the proposed SIASM framework yields substantial performance improvements in terms of network influence (up to 16.32%) and sustainable stealthiness (up to 16.29%) when evaluated against a robust detector with 90% accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation b/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation new file mode 100644 index 0000000000..3e6a84e42f --- /dev/null +++ b/data/2024/aaai/Adversarially Balanced Representation for Continuous Treatment Effect Estimation @@ -0,0 +1,3 @@ +Individual treatment effect (ITE) estimation requires adjusting for the covariate shift between populations with different treatments, and deep representation learning has shown great promise in learning a balanced representation of covariates. However the existing methods mostly consider the scenario of binary treatments. In this paper, we consider the more practical and challenging scenario in which the treatment is a continuous variable (e.g. dosage of a medication), and we address the two main challenges of this setup. We propose the adversarial counterfactual regression network (ACFR) that adversarially minimizes the representation imbalance in terms of KL divergence, and also maintains the impact of the treatment value on the outcome prediction by leveraging an attention mechanism. +Theoretically we demonstrate that ACFR objective function is grounded in an upper bound on counterfactual outcome prediction error. +Our experimental evaluation on semi-synthetic datasets demonstrates the empirical superiority of ACFR over a range of state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer b/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer new file mode 100644 index 0000000000..b946ed0f58 --- /dev/null +++ b/data/2024/aaai/AesFA: An Aesthetic Feature-Aware Arbitrary Neural Style Transfer @@ -0,0 +1 @@ +Neural style transfer (NST) has evolved significantly in recent years. Yet, despite its rapid progress and advancement, existing NST methods either struggle to transfer aesthetic information from a style effectively or suffer from high computational costs and inefficiencies in feature disentanglement due to using pre-trained models. This work proposes a lightweight but effective model, AesFA---Aesthetic Feature-Aware NST. The primary idea is to decompose the image via its frequencies to better disentangle aesthetic styles from the reference image while training the entire model in an end-to-end manner to exclude pre-trained models at inference completely. To improve the network's ability to extract more distinct representations and further enhance the stylization quality, this work introduces a new aesthetic feature: contrastive loss. Extensive experiments and ablations show the approach not only outperforms recent NST methods in terms of stylization quality, but it also achieves faster inference. Codes are available at https://github.com/Sooyyoungg/AesFA. \ No newline at end of file diff --git a/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation b/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation new file mode 100644 index 0000000000..3c0611173e --- /dev/null +++ b/data/2024/aaai/Agile Multi-Source-Free Domain Adaptation @@ -0,0 +1 @@ +Efficiently utilizing rich knowledge in pretrained models has become a critical topic in the era of large models. This work focuses on adaptively utilize knowledge from multiple source-pretrained models to an unlabeled target domain without accessing the source data. Despite being a practically useful setting, existing methods require extensive parameter tuning over each source model, which is computationally expensive when facing abundant source domains or larger source models. To address this challenge, we propose a novel approach which is free of the parameter tuning over source backbones. Our technical contribution lies in the Bi-level ATtention ENsemble (Bi-ATEN) module, which learns both intra-domain weights and inter-domain ensemble weights to achieve a fine balance between instance specificity and domain consistency. By slightly tuning source bottlenecks, we achieve comparable or even superior performance on a challenging benchmark DomainNet with less than 3% trained parameters and 8 times of throughput compared with SOTA method. Furthermore, with minor modifications, the proposed module can be easily equipped to existing methods and gain more than 4% performance boost. Code is available at https://github.com/TL-UESTC/Bi-ATEN. \ No newline at end of file diff --git a/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound b/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound new file mode 100644 index 0000000000..947818a2e7 --- /dev/null +++ b/data/2024/aaai/Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound @@ -0,0 +1 @@ +In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves an open problem related to upper bounds of hypothesis space constraints. We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples, and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the state-of-the-art algorithms on the same or a smaller budget. \ No newline at end of file diff --git a/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption b/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption new file mode 100644 index 0000000000..4e7e59054d --- /dev/null +++ b/data/2024/aaai/Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption @@ -0,0 +1 @@ +The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission theory that posits visual perception as a result of rays emanating from the eyes, we slightly refine the conventional NeRF framework to train NeRF under challenging light conditions and generate normal-light condition novel views unsupervisedly. We introduce the concept of a ``Concealing Field," which assigns transmittance values to the surrounding air to account for illumination effects. In dark scenarios, we assume that object emissions maintain a standard lighting level but are attenuated as they traverse the air during the rendering process. Concealing Field thus compel NeRF to learn reasonable density and colour estimations for objects even in dimly lit situations. Similarly, the Concealing Field can mitigate over-exposed emissions during rendering stage. Furthermore, we present a comprehensive multi-view dataset captured under challenging illumination conditions for evaluation. Our code and proposed dataset are available at https://github.com/cuiziteng/Aleth-NeRF. \ No newline at end of file diff --git a/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data b/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data new file mode 100644 index 0000000000..cb8b073e24 --- /dev/null +++ b/data/2024/aaai/Algorithmic Foundation of Federated Learning with Sequential Data @@ -0,0 +1,4 @@ +The current analysis of federated optimization algorithms for training deep neural networks assumes that the data is non-sequential (e.g., images), which incurs a smooth loss objective. In contrast, edge devices generate lots of sequential data every day, where these sequences exhibit significant sequential correlation at different time stamps (e.g., text messages). In order to learn from such sequential data, people typically use a class of neural networks that is inherently nonsmooth, with a potentially unbounded smoothness parameter. Examples include recurrent neural networks, long-short-term memory networks, and transformers. It remains unclear how to design provably efficient algorithms for training these neural networks to learn from sequential data. My goal is to lay the algorithmic foundation of federated learning with sequential data, which contributes novel algorithms for learning from a range of real-world sequential data (e.g., natural language, electronic health record, transportation, time series, etc.) using state-of-the-art deep neural networks. + + +In this talk, I will first motivate the problem by showing that the transformer, which is widely used for sequential data learning, has an unbounded smooth landscape. Then, I will introduce provably efficient federated deep learning algorithms in the presence of unbounded smoothness. In particular, I will introduce a few efficient algorithms for various settings of federated learning, including homogeneous data, heterogeneous data, and partial client participation. The main result is twofold. First, we show that the designed algorithms provably small computational and communication complexities. Second, we establish fundamental hardness results in the unbounded smoothness setting. Ultimately, I will discuss the future challenges of extending our research framework from small-scale neural networks to large language models. \ No newline at end of file diff --git "a/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment" "b/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment" new file mode 100644 index 0000000000..105effe8e7 --- /dev/null +++ "b/data/2024/aaai/Aligner\302\262: Enhancing Joint Multiple Intent Detection and Slot Filling via Adjustive and Forced Cross-Task Alignment" @@ -0,0 +1 @@ +Multi-intent spoken language understanding (SLU) has garnered growing attention due to its ability to handle multiple intent utterances, which closely mirrors practical scenarios. Unlike traditional SLU, each intent in multi-intent SLU corresponds to its designated scope for slots, which occurs in certain fragments within the utterance. As a result, establishing precise scope alignment to mitigate noise impact emerges as a key challenge in multi-intent SLU. More seriously, they lack alignment between the predictions of the two sub-tasks due to task-independent decoding, resulting in a limitation on the overall performance. To address these challenges, we propose a novel framework termed Aligner² for multi-intent SLU, which contains an Adjustive Cross-task Aligner (ACA) and a Forced Cross-task Aligner (FCA). ACA utilizes the information conveyed by joint label embeddings to accurately align the scope of intent and corresponding slots, before the interaction of the two subtasks. FCA introduces reinforcement learning, to enforce the alignment of the task-specific hidden states after the interaction, which is explicitly guided by the prediction. Extensive experiments on two public multi-intent SLU datasets demonstrate the superiority of our Aligner² over state-of-the-art methods. More encouragingly, the proposed method Aligner² can be easily integrated into existing multi-intent SLU frameworks, to further boost performance. \ No newline at end of file diff --git a/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination b/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination new file mode 100644 index 0000000000..43f0866b79 --- /dev/null +++ b/data/2024/aaai/Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination @@ -0,0 +1 @@ +Cross-view geo-localization holds significant potential for various applications, but drastic differences in viewpoints and visual appearances between cross-view images make this task extremely challenging. Recent works have made notable progress in cross-view geo-localization. However, existing methods either ignore the correspondence between geometric spatial layout in cross-view images or require high costs or strict constraints to achieve such alignment. In response to these challenges, we propose a Feature Recombination Module (FRM) that explicitly establishes the geometric spatial layout correspondences between two views. Unlike existing methods, FRM aligns geometric spatial layout by directly recombining features, avoiding image preprocessing, and introducing no additional computational and parameter costs. This effectively reduces ambiguities caused by geometric misalignments between ground-level and aerial-level images. Furthermore, it is not sensitive to frameworks and applies to both CNN-based and Transformer-based architectures. Additionally, as part of the training procedure, we also introduce a novel weighted (B+1)-tuple loss (WBL) as optimization objective. Compared to the widely used weighted soft margin ranking loss, this innovative loss enhances convergence speed and final performance. Based on the two core components (FRM and WBL), we develop an end-to-end network architecture (FRGeo) to address these limitations from a different perspective. Extensive experiments show that our proposed FRGeo not only achieves state-of-the-art performance on cross-view geo-localization benchmarks, including CVUSA, CVACT, and VIGOR, but also is significantly superior or competitive in terms of computational complexity and trainable parameters. Our project homepage is at https://zqwlearning.github.io/FRGeo. \ No newline at end of file diff --git a/data/2024/aaai/All Beings Are Equal in Open Set Recognition b/data/2024/aaai/All Beings Are Equal in Open Set Recognition new file mode 100644 index 0000000000..668591463f --- /dev/null +++ b/data/2024/aaai/All Beings Are Equal in Open Set Recognition @@ -0,0 +1 @@ +In open-set recognition (OSR), a promising strategy is exploiting pseudo-unknown data outside given K known classes as an additional K+1-th class to explicitly model potential open space. However, treating unknown classes without distinction is unequal for them relative to known classes due to the category-agnostic and scale-agnostic of the unknowns. This inevitably not only disrupts the inherent distributions of unknown classes but also incurs both class-wise and instance-wise imbalances between known and unknown classes. Ideally, the OSR problem should model the whole class space as K+∞, but enumerating all unknowns is impractical. Since the core of OSR is to effectively model the boundaries of known classes, this means just focusing on the unknowns nearing the boundaries of targeted known classes seems sufficient. Thus, as a compromise, we convert the open classes from infinite to K, with a novel concept Target-Aware Universum (TAU) and propose a simple yet effective framework Dual Contrastive Learning with Target-Aware Universum (DCTAU). In details, guided by the targeted known classes, TAU automatically expands the unknown classes from the previous 1 to K, effectively alleviating the distribution disruption and the imbalance issues mentioned above. Then, a novel Dual Contrastive (DC) loss is designed, where all instances irrespective of known or TAU are considered as positives to contrast with their respective negatives. Experimental results indicate DCTAU sets a new state-of-the-art. \ No newline at end of file diff --git a/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation b/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation new file mode 100644 index 0000000000..625569a95b --- /dev/null +++ b/data/2024/aaai/All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation @@ -0,0 +1 @@ +Fairness in Language Models (LMs) remains a long-standing challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates/exemplars. Regardless, they don’t address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model’s understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability. \ No newline at end of file diff --git a/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models b/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models new file mode 100644 index 0000000000..6a757c1de8 --- /dev/null +++ b/data/2024/aaai/All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models @@ -0,0 +1 @@ +Text-to-Image models such as Stable Diffusion have shown impressive image generation synthesis, thanks to the utilization of large-scale datasets. However, these datasets may contain sexually explicit, copyrighted, or undesirable content, which allows the model to directly generate them. Given that retraining these large models on individual concept deletion requests is infeasible, fine-tuning algorithms have been developed to tackle concept erasing in diffusion models. While these algorithms yield good concept erasure, they all present one of the following issues: 1) the corrupted feature space yields synthesis of disintegrated objects, 2) the initially synthesized content undergoes a divergence in both spatial structure and semantics in the generated images, and 3) sub-optimal training updates heighten the model's susceptibility to utility harm. These issues severely degrade the original utility of generative models. In this work, we present a new approach that solves all of these challenges. We take inspiration from the concept of classifier guidance and propose a surgical update on the classifier guidance term while constraining the drift of the unconditional score term. Furthermore, our algorithm empowers the user to select an alternative to the erasing concept, allowing for more controllability. Our experimental results show that our algorithm not only erases the target concept effectively but also preserves the model’s generation capability. \ No newline at end of file diff --git a/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model b/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model new file mode 100644 index 0000000000..6db0f0466e --- /dev/null +++ b/data/2024/aaai/AltDiffusion: A Multilingual Text-to-Image Diffusion Model @@ -0,0 +1 @@ +Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson. \ No newline at end of file diff --git a/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization b/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization new file mode 100644 index 0000000000..4218f36eac --- /dev/null +++ b/data/2024/aaai/AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization @@ -0,0 +1 @@ +Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality. \ No newline at end of file diff --git a/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures b/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures new file mode 100644 index 0000000000..49f204c5a8 --- /dev/null +++ b/data/2024/aaai/Amalgamating Multi-Task Models with Heterogeneous Architectures @@ -0,0 +1 @@ +Multi-task learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments. \ No newline at end of file diff --git a/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion b/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion new file mode 100644 index 0000000000..f3c7df6a2b --- /dev/null +++ b/data/2024/aaai/Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion @@ -0,0 +1,4 @@ +Amodal scene analysis entails interpreting the occlusion relationship among scene elements and inferring the possible shapes of the invisible parts. Existing methods typically frame this task as an extended instance segmentation or a pair-wise object de-occlusion problem. In this work, we propose a new framework, which comprises a Holistic Occlusion Relation Inference (HORI) module followed by an instance-level Generative Mask Completion (GMC) module. + Unlike previous approaches, which rely on mask completion results for occlusion reasoning, our HORI module directly predicts an occlusion relation matrix in a single pass. This approach is much more efficient than the pair-wise de-occlusion process and it naturally handles mutual occlusion, a common but often neglected situation. + Moreover, we formulate the mask completion task as a generative process and use a diffusion-based GMC module for instance-level mask completion. This improves mask completion quality and provides multiple plausible solutions. + We further introduce a large-scale amodal segmentation dataset with high-quality human annotations, including mutual occlusions. Experiments on our dataset and two public benchmarks demonstrate the advantages of our method. code public available at https://github.com/zbwxp/Amodal-AAAI. \ No newline at end of file diff --git a/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract) b/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract) new file mode 100644 index 0000000000..5fd1580479 --- /dev/null +++ b/data/2024/aaai/Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (Student Abstract) @@ -0,0 +1 @@ +Conventional commonsense knowledge graph completion (CKGC) methods provide inadequate sequence when fine-tuning or generating stages and incorporate full fine-tuning, which fail to align with the autoregressive model's pre-training patterns and have insufficient parameter efficiency. Moreover, decoding through beam or greedy search produces low diversity and high similarity in generated tail entities. Hence, we resort to prefix-tuning and propose a lightweight, effective pipeline to enhance the quality and diversity of extracted commonsense knowledge. Precisely, we measure head entity similarity to yield and then concatenate top-k tuples before each target tuple for prefix-tuning the source LM, thereby improving the efficiency and speed for pretrained models; then, we design a penalty-tailored diverse beam search (p-DBS) for decoding tail entities, producing a greater quantity and diversity of generated commonsense tuples; besides, a filter strategy is utilized to filter out invalid commonsense knowledge. Through extensive automatic evaluations, including ChatGPT scoring, our method can extract diverse, novel, and accurate commonsense knowledge (CK). \ No newline at end of file diff --git a/data/2024/aaai/An Approximate Skolem Function Counter b/data/2024/aaai/An Approximate Skolem Function Counter new file mode 100644 index 0000000000..087b1db786 --- /dev/null +++ b/data/2024/aaai/An Approximate Skolem Function Counter @@ -0,0 +1,5 @@ +One approach to probabilistic inference involves counting the number of models of a given Boolean formula. Here, we are interested in inferences involving higher-order objects, i.e., functions. We study the following task: Given a Boolean specification between a set of inputs and outputs, count the number of functions of inputs such that the specification is met. Such functions are called Skolem functions. + +We are motivated by the recent development of scalable approaches to Boolean function synthesis. This stands in relation to our problem analogously to the relationship between Boolean satisfiability and the model counting problem. Yet, counting Skolem functions poses considerable new challenges. From the complexity-theoretic standpoint, counting Skolem functions is not only #P-hard; it is quite unlikely to have an FPRAS (Fully Polynomial Randomized Approximation Scheme) as the problem of synthesizing a Skolem function remains challenging, even given access to an NP oracle. + +The primary contribution of this work is the first algorithm, SkolemFC, that computes the number of Skolem functions. SkolemFC relies on technical connections between counting functions and propositional model counting: our algorithm makes a linear number of calls to an approximate model counter and computes an estimate of the number of Skolem functions with theoretical guarantees. Our prototype displays impressive scalability, handling benchmarks comparably to state-of-the-art Skolem function synthesis engines, even though counting all such functions ostensibly poses a greater challenge than synthesizing a single function. \ No newline at end of file diff --git a/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention b/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention new file mode 100644 index 0000000000..995d88f6c6 --- /dev/null +++ b/data/2024/aaai/An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention @@ -0,0 +1 @@ +Sequential recommendation (SR) models based on Transformers have achieved remarkable successes. The self-attention mechanism of Transformers for computer vision and natural language processing suffers from the oversmoothing problem, i.e., hidden representations becoming similar to tokens. In the SR domain, we, for the first time, show that the same problem occurs. We present pioneering investigations that reveal the low-pass filtering nature of self-attention in the SR, which causes oversmoothing. To this end, we propose a novel method called Beyond Self-Attention for Sequential Recommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec. \ No newline at end of file diff --git a/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction b/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction new file mode 100644 index 0000000000..b56e31117d --- /dev/null +++ b/data/2024/aaai/An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction @@ -0,0 +1 @@ +In this paper, we propose a novel method for joint entity and relation extraction from unstructured text by framing it as a conditional sequence generation problem. In contrast to conventional generative information extraction models that are left-to-right token-level generators, our approach is \textit{span-based}. It generates a linearized graph where nodes represent text spans and edges represent relation triplets. Our method employs a transformer encoder-decoder architecture with pointing mechanism on a dynamic vocabulary of spans and relation types. Our model can capture the structural characteristics and boundaries of entities and relations through span representations while simultaneously grounding the generated output in the original text thanks to the pointing mechanism. Evaluation on benchmark datasets validates the effectiveness of our approach, demonstrating competitive results. Code is available at https://github.com/urchade/ATG. \ No newline at end of file diff --git a/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes b/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes new file mode 100644 index 0000000000..e6bb65dc18 --- /dev/null +++ b/data/2024/aaai/An Eager Satisfiability Modulo Theories Solver for Algebraic Datatypes @@ -0,0 +1 @@ +Algebraic data types (ADTs) are a construct classically found in functional programming languages that capture data structures like enumerated types, lists, and trees. In recent years, interest in ADTs has increased. For example, popular programming languages, like Python, have added support for ADTs. Automated reasoning about ADTs can be done using satisfiability modulo theories (SMT) solving, an extension of the Boolean satisfiability problem with first-order logic and associated background theories. Unfortunately, SMT solvers that support ADTs do not scale as state-of-the-art approaches all use variations of the same lazy approach. In this paper, we present an SMT solver that takes a fundamentally different approach, an eager approach. Specifically, our solver reduces ADT queries to a simpler logical theory, uninterpreted functions (UF), and then uses an existing solver on the reduced query. We prove the soundness and completeness of our approach and demonstrate that it outperforms the state of the art on existing benchmarks, as well as a new, more challenging benchmark set from the planning domain. \ No newline at end of file diff --git a/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization b/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization new file mode 100644 index 0000000000..0489f7f758 --- /dev/null +++ b/data/2024/aaai/An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization @@ -0,0 +1,2 @@ +The significance of multi-view learning in effectively mitigating the intricate intricacies entrenched within heterogeneous data has garnered substantial attention in recent years. Notwithstanding the favorable achievements showcased by recent strides in this area, a confluence of noteworthy challenges endures. To be specific, a majority of extant methodologies unceremoniously assign weights to data points view-wisely. This ineluctably disregards the intrinsic reality that disparate views confer diverse contributions to each individual sample, consequently neglecting the rich wellspring of sample-level structural insights harbored within the dataset. In this paper, we proposed an effective Augmented Lagrangian MethOd for fiNe-graineD (ALMOND) multi-view optimization. +This innovative approach scrutinizes the interplay among multiple views at the granularity of individual samples, thereby fostering the enhanced preservation of local structural coherence. The Augmented Lagrangian Method (ALM) is elaborately incorporated into our framework, which enables us to achieve an optimal solution without involving an inexplicable intermediate variable as previous methods do. Empirical experiments on multi-view clustering tasks across heterogeneous datasets serve to incontrovertibly showcase the effectiveness of our proposed methodology, corroborating its preeminence over incumbent state-of-the-art alternatives. \ No newline at end of file diff --git a/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away b/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away new file mode 100644 index 0000000000..818a7da189 --- /dev/null +++ b/data/2024/aaai/An Effective Polynomial Technique for Compiling Conditional Effects Away @@ -0,0 +1 @@ +The paper introduces a novel polynomial compilation technique for the sound and complete removal of conditional effects in classical planning problems. Similar to Nebel's polynomial compilation of conditional effects, our solution also decomposes each action with conditional effects into several simpler actions. However, it does so more effectively by exploiting the actual structure of the given conditional effects. We characterise such a structure using a directed graph and leverage it to significantly reduce the number of additional atoms required, thereby shortening the size of valid plans. Our experimental analysis indicates that this approach enables the effective use of polynomial compilations, offering benefits in terms of modularity and reusability of existing planners. It also demonstrates that a compilation-based approach can be more efficient, either independently or in synergy with state-of-the-art optimal planners that directly support conditional effects. \ No newline at end of file diff --git a/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms b/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms new file mode 100644 index 0000000000..457c2f0dd9 --- /dev/null +++ b/data/2024/aaai/An Effectiveness Study of Teacher-Led AI Literacy Curriculum in K-12 Classrooms @@ -0,0 +1 @@ +Artificial intelligence (AI) has rapidly pervaded and reshaped almost all walks of life, but efforts to promote AI literacy in K-12 schools remain limited. There is a knowledge gap in how to prepare teachers to teach AI literacy in inclusive classrooms and how teacher-led classroom implementations can impact students. This paper reports a comparison study to investigate the effectiveness of an AI literacy curriculum when taught by classroom teachers. The experimental group included 89 middle school students who learned an AI literacy curriculum during regular school hours. The comparison group consisted of 69 students who did not learn the curriculum. Both groups completed the same pre and post-test. The results show that students in the experimental group developed a deeper understanding of AI concepts and more positive attitudes toward AI and its impact on future careers after the curriculum than those in the comparison group. This shows that the teacher-led classroom implementation successfully equipped students with a conceptual understanding of AI. Students achieved significant gains in recognizing how AI is relevant to their lives and felt empowered to thrive in the age of AI. Overall this study confirms the potential of preparing K-12 classroom teachers to offer AI education in classrooms in order to reach learners of diverse backgrounds and broaden participation in AI literacy education among young learners. \ No newline at end of file diff --git a/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain b/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain new file mode 100644 index 0000000000..106a9dc482 --- /dev/null +++ b/data/2024/aaai/An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain @@ -0,0 +1,2 @@ +Spiking neural networks (SNNs) are rich in spatio-temporal dynamics and are suitable for processing event-based neuromorphic data. However, event-based datasets are usually less annotated than static datasets. This small data scale makes SNNs prone to overfitting and limits their performance. In order to improve the generalization ability of SNNs on event-based datasets, we use static images to assist SNN training on event data. In this paper, we first discuss the domain mismatch problem encountered when directly transferring networks trained on static datasets to event data. We argue that the inconsistency of feature distributions becomes a major factor hindering the effective transfer of knowledge from static images to event data. To address this problem, we propose solutions in terms of two aspects: feature distribution and training strategy. Firstly, we propose a knowledge transfer loss, which consists of domain alignment loss and spatio-temporal regularization. The domain alignment loss learns domain-invariant spatial features by reducing the marginal distribution distance between the static image and the event data. Spatio-temporal regularization provides dynamically learnable coefficients for domain alignment loss by using the output features of the event data at each time step as a regularization term. In addition, we propose a sliding training strategy, which gradually replaces static image inputs probabilistically with event data, resulting in a smoother and more stable training for the network. We validate our method on neuromorphic datasets, including N-Caltech101, CEP-DVS, and N-Omniglot. The experimental results show that our proposed method achieves better performance on all datasets compared to the current state-of-the-art methods. +Code is available at https://github.com/Brain-Cog-Lab/Transfer-for-DVS. \ No newline at end of file diff --git a/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs b/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs new file mode 100644 index 0000000000..d6db6590a2 --- /dev/null +++ b/data/2024/aaai/An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs @@ -0,0 +1 @@ +Heterogeneous Graph Neural Networks (HGNNs) play a vital role in advancing the field of graph representation learning by addressing the complexities arising from diverse data types and interconnected relationships in real-world scenarios. However, traditional HGNNs face challenges when applied to large-scale graphs due to the necessity of training or inferring on the entire graph. As the size of the heterogeneous graphs increases, the time and memory overhead required by these models escalates rapidly, even reaching unacceptable levels. To address this issue, in this paper, we present a novel framework named (SubInfer), which conducts training and inferring on subgraphs instead of the entire graphs, hence efficiently handling large-scale heterogeneous graphs. The proposed framework comprises three main steps: 1) partitioning the heterogeneous graph from multiple perspectives to preserve various semantic information, 2) completing the subgraphs to improve the convergence speed of subgraph training and the performance of subgraph inference, and 3) training and inferring the HGNN model on distributed clusters to further reduce the time overhead. The framework is applicable to the vast majority of HGNN models. Experiments on five benchmark datasets demonstrate that SubInfer effectively optimizes the training and inference phase, delivering comparable performance to traditional HGNN models while significantly reducing time and memory overhead. \ No newline at end of file diff --git a/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment b/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment new file mode 100644 index 0000000000..309c908bfd --- /dev/null +++ b/data/2024/aaai/An Embedding-Unleashing Video Polyp Segmentation Framework via Region Linking and Scale Alignment @@ -0,0 +1 @@ +Automatic polyp segmentation from colonoscopy videos is a critical task for the development of computer-aided screening and diagnosis systems. However, accurate and real-time video polyp segmentation (VPS) is a very challenging task due to low contrast between background and polyps and frame-to-frame dramatic variations in colonoscopy videos. We propose a novel embedding-unleashing framework consisting of a proposal-generative network (PGN) and an appearance-embedding network (AEN) to comprehensively address these challenges. Our framework, for the first time, models VPS as an appearance-level semantic embedding process to facilitate generate more global information to counteract background disturbances and dramatic variations. Specifically, PGN is a video segmentation network to obtain segmentation mask proposals, while AEN is a network we specially designed to produce appearance-level embedding semantics for PGN, thereby unleashing the capability of PGN in VPS. Our AEN consists of a cross-scale region linking (CRL) module and a cross-wise scale alignment (CSA) module. The former screens reliable background information against background disturbances by constructing linking of region semantics, while the latter performs the scale alignment to resist dramatic variations by modeling the center-perceived motion dependence with a cross-wise manner. We further introduce a parameter-free semantic interaction to embed the semantics of AEN into PGN to obtain the segmentation results. Extensive experiments on CVC-612 and SUN-SEG demonstrate that our approach achieves better performance than other state-of-the-art methods. Codes are available at https://github.com/zhixue-fang/EUVPS. \ No newline at end of file diff --git a/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search b/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search new file mode 100644 index 0000000000..c260ecdf69 --- /dev/null +++ b/data/2024/aaai/An Empirical Study of CLIP for Text-Based Person Search @@ -0,0 +1 @@ +Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research. \ No newline at end of file diff --git a/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract) b/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract) new file mode 100644 index 0000000000..c04ecf0a47 --- /dev/null +++ b/data/2024/aaai/An Empirical Study of Distributed Deep Learning Training on Edge (Student Abstract) @@ -0,0 +1 @@ +Deep learning (DL), despite its success in various fields, remains expensive and inaccessible to many due to its need for powerful supercomputing and high-end GPUs. This study explores alternative computing infrastructure and methods for distributed DL on low-energy, low-cost devices. We experiment on Raspberry Pi 4 devices with ARM Cortex-A72 processors and train a ResNet-18 model on the CIFAR-10 dataset. Our findings reveal limitations and opportunities for future optimizations, paving the way for a DL toolset for low-energy edge devices. \ No newline at end of file diff --git a/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled b/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled new file mode 100644 index 0000000000..03f1012425 --- /dev/null +++ b/data/2024/aaai/An Exercise in Tournament Design: When Some Matches Must Be Scheduled @@ -0,0 +1 @@ +Single-elimination (SE) tournaments are a popular format used in competitive environments and decision making. Algorithms for SE tournament manipulation have been an active topic of research in recent years. In this paper, we initiate the algorithmic study of a novel variant of SE tournament manipulation that aims to model the fact that certain matchups are highly desired in a sporting context, incentivizing an organizer to manipulate the bracket to make such matchups take place. We obtain both hardness and tractability results. We show that while the problem of computing a bracket enforcing a given set of matches in an SE tournament is NP-hard, there are natural restrictions that lead to polynomial-time solvability. In particular, we show polynomial-time solvability if there is a linear ordering on the ability of players with only a constant number of exceptions where a player with lower ability beats a player with higher ability. \ No newline at end of file diff --git a/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning b/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning new file mode 100644 index 0000000000..651b32ba0f --- /dev/null +++ b/data/2024/aaai/An Implicit Trust Region Approach to Behavior Regularized Offline Reinforcement Learning @@ -0,0 +1 @@ +We revisit behavior regularization, a popular approach to mitigate the extrapolation error in offline reinforcement learning (RL), showing that current behavior regularization may suffer from unstable learning and hinder policy improvement. Motivated by this, a novel reward shaping-based behavior regularization method is proposed, where the log-probability ratio between the learned policy and the behavior policy is monitored during learning. We show that this is equivalent to an implicit but computationally lightweight trust region mechanism, which is beneficial to mitigate the influence of estimation errors of the value function, leading to more stable performance improvement. Empirical results on the popular D4RL benchmark verify the effectiveness of the presented method with promising performance compared with some state-of-the-art offline RL algorithms. \ No newline at end of file diff --git a/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness b/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness new file mode 100644 index 0000000000..6d62ca7778 --- /dev/null +++ b/data/2024/aaai/An Information-Flow Perspective on Algorithmic Fairness @@ -0,0 +1,5 @@ +This work presents insights gained by investigating the relationship between algorithmic fairness and the concept of secure information flow. The problem of enforcing secure information flow is well-studied in the context of information security: If secret information may "flow" through an algorithm or program in such a way that it can influence the program’s output, then that is considered insecure information flow as attackers could potentially observe (parts of) the secret. + +There is a strong correspondence between secure information flow and algorithmic fairness: if protected attributes such as race, gender, or age are treated as secret program inputs, then secure information flow means that these "secret" attributes cannot influence the result of a program. While most research in algorithmic fairness evaluation concentrates on studying the impact of algorithms (often treating the algorithm as a black-box), the concepts derived from information flow can be used both for the analysis of disparate treatment as well as disparate impact w.r.t. a structural causal model. + +In this paper, we examine the relationship between quantitative as well as qualitative information-flow properties and fairness. Moreover, based on this duality, we derive a new quantitative notion of fairness called fairness spread, which can be easily analyzed using quantitative information flow and which strongly relates to counterfactual fairness. We demonstrate that off-the-shelf tools for information-flow properties can be used in order to formally analyze a program's algorithmic fairness properties, including the new notion of fairness spread as well as established notions such as demographic parity. \ No newline at end of file diff --git a/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations b/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations new file mode 100644 index 0000000000..69a7acaad5 --- /dev/null +++ b/data/2024/aaai/An Interpretable Approach to the Solutions of High-Dimensional Partial Differential Equations @@ -0,0 +1 @@ +In recent years, machine learning algorithms, especially deep learning, have shown promising prospects in solving Partial Differential Equations (PDEs). However, as the dimension increases, the relationship and interaction between variables become more complex, and existing methods are difficult to provide fast and interpretable solutions for high-dimensional PDEs. To address this issue, we propose a genetic programming symbolic regression algorithm based on transfer learning and automatic differentiation to solve PDEs. This method uses genetic programming to search for a mathematically understandable expression and combines automatic differentiation to determine whether the search result satisfies the PDE and boundary conditions to be solved. To overcome the problem of slow solution speed caused by large search space, we propose a transfer learning mechanism that transfers the structure of one-dimensional PDE analytical solution to the form of high-dimensional PDE solution. We tested three representative types of PDEs, and the results showed that our proposed method can obtain reliable and human-understandable real solutions or algebraic equivalent solutions of PDEs, and the convergence speed is better than the compared methods. Code of this project is at https://github.com/grassdeerdeer/HD-TLGP. \ No newline at end of file diff --git a/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering b/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering new file mode 100644 index 0000000000..07e78a938d --- /dev/null +++ b/data/2024/aaai/An Optimal Transport View for Subspace Clustering and Spectral Clustering @@ -0,0 +1 @@ +Clustering is one of the most fundamental problems in machine learning and data mining, and many algorithms have been proposed in the past decades. Among them, subspace clustering and spectral clustering are the most famous approaches. In this paper, we provide an explanation for subspace clustering and spectral clustering from the perspective of optimal transport. Optimal transport studies how to move samples from one distribution to another distribution with minimal transport cost, and has shown a powerful ability to extract geometric information. By considering a self optimal transport model with only one group of samples, we observe that both subspace clustering and spectral clustering can be explained in the framework of optimal transport, and the optimal transport matrix bridges the spaces of features and spectral embeddings. Inspired by this connection, we propose a spectral optimal transport barycenter model, which learns spectral embeddings by solving a barycenter problem equipped with an optimal transport discrepancy and guidance of data. Based on our proposed model, we take advantage of optimal transport to exploit both feature and metric information involved in data for learning coupled spectral embeddings and affinity matrix in a unified model. We develop an alternating optimization algorithm to solve the resultant problems, and conduct experiments in different settings to evaluate the performance of our proposed methods. \ No newline at end of file diff --git a/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach b/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach new file mode 100644 index 0000000000..7cd1ee22cd --- /dev/null +++ b/data/2024/aaai/Analysis of Differentially Private Synthetic Data: A Measurement Error Approach @@ -0,0 +1 @@ +Differentially private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of maximum likelihood estimator (MLE)-based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism. \ No newline at end of file diff --git a/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias b/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias new file mode 100644 index 0000000000..94364aeaf7 --- /dev/null +++ b/data/2024/aaai/Analytically Tractable Models for Decision Making under Present Bias @@ -0,0 +1 @@ +Time-inconsistency is a characteristic of human behavior in which people plan for long-term benefits but take actions that differ from the plan due to conflicts with short-term benefits. Such time-inconsistent behavior is believed to be caused by present bias, a tendency to overestimate immediate rewards and underestimate future rewards. It is essential in behavioral economics to investigate the relationship between present bias and time-inconsistency. In this paper, we propose a model for analyzing agent behavior with present bias in tasks to make progress toward a goal over a specific period. Unlike previous models, the state sequence of the agent can be described analytically in our model. Based on this property, we analyze three crucial problems related to agents under present bias: task abandonment, optimal goal setting, and optimal reward scheduling. Extensive analysis reveals how present bias affects the condition under which task abandonment occurs and optimal intervention strategies. Our findings are meaningful for preventing task abandonment and intervening through incentives in the real world. \ No newline at end of file diff --git a/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System b/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System new file mode 100644 index 0000000000..d7edacbddd --- /dev/null +++ b/data/2024/aaai/Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System @@ -0,0 +1 @@ +Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state spaces, using a novel analysis technique known as state division. In contrast to prior approaches that employ state division merely as a post-hoc explanatory tool, our methodology delves into the intrinsic characteristics of DRL policy networks. Specifically, we demonstrate that the expansion of state space induces the activation function $\tanh$ to exhibit saturability, resulting in the transformation of the state division boundary from nonlinear to linear. Our analysis centers on the paradigm of the double-integrator system, revealing that this gradual shift towards linearity imparts a control behavior reminiscent of bang-bang control. However, the inherent linearity of the division boundary prevents the attainment of an ideal bang-bang control, thereby introducing unavoidable overshooting. Our experimental investigations, employing diverse RL algorithms, establish that this performance phenomenon stems from inherent attributes of the DRL policy network, remaining consistent across various optimization algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs b/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs new file mode 100644 index 0000000000..0e87c5eac3 --- /dev/null +++ b/data/2024/aaai/Anchoring Path for Inductive Relation Prediction in Knowledge Graphs @@ -0,0 +1 @@ +Aiming to accurately predict missing edges representing relations between entities, which are pervasive in real-world Knowledge Graphs (KGs), relation prediction plays a critical role in enhancing the comprehensiveness and utility of KGs. Recent research focuses on path-based methods due to their inductive and explainable properties. However, these methods face a great challenge when lots of reasoning paths do not form Closed Paths (CPs) in the KG. To address this challenge, we propose Anchoring Path Sentence Transformer (APST) by introducing Anchoring Paths (APs) to alleviate the reliance of CPs. Specifically, we develop a search-based description retrieval method to enrich entity descriptions and an assessment mechanism to evaluate the rationality of APs. APST takes both APs and CPs as the inputs of a unified Sentence Transformer architecture, enabling comprehensive predictions and high-quality explanations. We evaluate APST on three public datasets and achieve state-of-the-art (SOTA) performance in 30 of 36 transductive, inductive, and few-shot experimental settings. \ No newline at end of file diff --git a/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios b/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios new file mode 100644 index 0000000000..ccc364ee4f --- /dev/null +++ b/data/2024/aaai/Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios @@ -0,0 +1 @@ +Due to the inability to receive signals from the Global Navigation Satellite System (GNSS) in extreme conditions, achieving accurate and robust navigation for Unmanned Aerial Vehicles (UAVs) is a challenging task. Recently emerged, vision-based navigation has been a promising and feasible alternative to GNSS-based navigation. However, existing vision-based techniques are inadequate in addressing flight deviation caused by environmental disturbances and inaccurate position predictions in practical settings. In this paper, we present a novel angle robustness navigation paradigm to deal with flight deviation in point-to-point navigation tasks. Additionally, we propose a model that includes the Adaptive Feature Enhance Module, Cross-knowledge Attention-guided Module and Robust Task-oriented Head Module to accurately predict direction angles for high-precision navigation. To evaluate the vision-based navigation methods, we collect a new dataset termed as UAV_AR368. Furthermore, we design the Simulation Flight Testing Instrument (SFTI) using Google Earth to simulate different flight environments, thereby reducing the expenses associated with real flight testing. Experiment results demonstrate that the proposed model outperforms the state-of-the-art by achieving improvements of 26.0% and 45.6% in the success rate of arrival under ideal and disturbed circumstances, respectively. \ No newline at end of file diff --git a/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model b/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model new file mode 100644 index 0000000000..7f8125dda4 --- /dev/null +++ b/data/2024/aaai/AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model @@ -0,0 +1 @@ +Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion. \ No newline at end of file diff --git a/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models b/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models new file mode 100644 index 0000000000..60772a7744 --- /dev/null +++ b/data/2024/aaai/AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models @@ -0,0 +1 @@ +Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset. \ No newline at end of file diff --git a/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding b/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding new file mode 100644 index 0000000000..73477de6fc --- /dev/null +++ b/data/2024/aaai/Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding @@ -0,0 +1 @@ +While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development. \ No newline at end of file diff --git a/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images b/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images new file mode 100644 index 0000000000..9143bd65dc --- /dev/null +++ b/data/2024/aaai/Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images @@ -0,0 +1 @@ +Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed HD images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2X compared to the traditional tiled algorithm. The source code is available at https://github.com/ProAirVerse/Any-Size-Diffusion. \ No newline at end of file diff --git a/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching b/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching new file mode 100644 index 0000000000..f8daacf3d9 --- /dev/null +++ b/data/2024/aaai/Any-Stereo: Arbitrary Scale Disparity Estimation for Iterative Stereo Matching @@ -0,0 +1 @@ +Due to unaffordable computational costs, the regularized disparity in iterative stereo matching is typically maintained at a lower resolution than the input. To regress the full resolution disparity, most stereo methods resort to convolutions to decode a fixed-scale output. However, they are inadequate for recovering vital high-frequency information lost during downsampling, limiting their performance on full-resolution prediction. In this paper, we introduce AnyStereo, an accurate and efficient disparity upsampling module with implicit neural representation for the iterative stereo pipeline. By modeling the disparity as a continuous representation over 2D spatial coordinates, subtle details can emerge from the latent space at arbitrary resolution. To further complement the missing information and details in the latent code, we propose two strategies: intra-scale similarity unfolding and cross-scale feature alignment. The former unfolds the neighbor relationships, while the latter introduces the context in high-resolution feature maps. The proposed AnyStereo can seamlessly replace the upsampling module in most iterative stereo models, improving their ability to capture fine details and generate arbitrary-scale disparities even with fewer parameters. With our method, the iterative stereo pipeline establishes a new state-of-the-art performance. The code is available at https://github.com/Zhaohuai-L/Any-Stereo. \ No newline at end of file diff --git a/data/2024/aaai/Any-Way Meta Learning b/data/2024/aaai/Any-Way Meta Learning new file mode 100644 index 0000000000..1ff0fc0a08 --- /dev/null +++ b/data/2024/aaai/Any-Way Meta Learning @@ -0,0 +1,8 @@ +Although meta-learning seems promising performance in the realm of rapid adaptability, it is constrained by +fixed cardinality. When faced with tasks of varying cardinalities that were unseen during training, +the model lacks its ability. In this paper, we address and resolve this challenge +by harnessing `label equivalence' emerged from stochastic numeric label assignments during episodic task sampling. Questioning what defines ``true" meta-learning, we introduce the ``any-way" learning paradigm, an innovative model training approach that liberates model from +fixed cardinality constraints. Surprisingly, this model not only matches but often outperforms traditional fixed-way models in terms of performance, convergence speed, and stability. This disrupts established notions +about domain generalization. Furthermore, we argue that the inherent +label equivalence naturally lacks semantic information. To bridge this +semantic information gap arising from label equivalence, we further propose a mechanism for infusing semantic class information into the model. This would enhance the model's comprehension and functionality. Experiments conducted on renowned architectures like MAML and ProtoNet affirm the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain b/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain new file mode 100644 index 0000000000..09e76d1112 --- /dev/null +++ b/data/2024/aaai/Approval-Based Committee Voting in Practice: A Case Study of (over-)Representation in the Polkadot Blockchain @@ -0,0 +1 @@ +We provide the first large-scale data collection of real-world approval-based committee elections. These elections have been conducted on the Polkadot blockchain as part of their Nominated Proof-of-Stake mechanism and contain around one thousand candidates and tens of thousands of (weighted) voters each. We conduct an in-depth study of application-relevant questions, including a quantitative and qualitative analysis of the outcomes returned by different voting rules. Besides considering proportionality measures that are standard in the multiwinner voting literature, we pay particular attention to less-studied measures of overrepresentation, as these are closely related to the security of the Polkadot network. We also analyze how different design decisions such as the committee size affect the examined measures. \ No newline at end of file diff --git a/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners b/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners new file mode 100644 index 0000000000..3dce889761 --- /dev/null +++ b/data/2024/aaai/Approximate Distance Oracle for Fault-Tolerant Geometric Spanners @@ -0,0 +1,7 @@ +In this paper, we present approximate distance and shortest-path oracles for fault-tolerant Euclidean spanners motivated by the routing problem in real-world road networks. +A fault-tolerant Euclidean spanner for a set of points in Euclidean space is a graph +in which, despite the deletion of small number of any points, the distance between any two points in the damaged graph is an approximation of their Euclidean distance. +Given a fault-tolerant Euclidean spanner and a small approximation factor, +our data structure allows us to compute an approximate distance between two points in the damaged spanner in constant time when a query involves any two points and a small set of failed points. +Additionally, by incorporating additional data structures, we can return a path itself in time almost linear in the length of the returned path. +Both data structures require near-linear space. \ No newline at end of file diff --git a/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints b/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints new file mode 100644 index 0000000000..a38430cfb1 --- /dev/null +++ b/data/2024/aaai/Approximate Integer Solution Counts over Linear Arithmetic Constraints @@ -0,0 +1 @@ +Counting integer solutions of linear constraints has found interesting applications in various fields. It is equivalent to the problem of counting lattice points inside a polytope. However, state-of-the-art algorithms for this problem become too slow for even a modest number of variables. In this paper, we propose a new framework to approximate the lattice counts inside a polytope with a new random-walk sampling method. The counts computed by our approach has been proved approximately bounded by a (epsilon, delta)-bound. Experiments on extensive benchmarks show that our algorithm could solve polytopes with dozens of dimensions, which significantly outperforms state-of-the-art counters. \ No newline at end of file diff --git a/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets b/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets new file mode 100644 index 0000000000..42e6f220ac --- /dev/null +++ b/data/2024/aaai/Approximation Algorithms for Preference Aggregation Using CP-Nets @@ -0,0 +1 @@ +This paper studies the design and analysis of approximation algorithms for aggregating preferences over combinatorial domains, represented using Conditional Preference Networks (CP-nets). Its focus is on aggregating preferences over so-called swaps, for which optimal solutions in general are already known to be of exponential size. We first analyze a trivial 2-approximation algorithm that simply outputs the best of the given input preferences, and establish a structural condition under which the approximation ratio of this algorithm is improved to 4/3. We then propose a polynomial-time approximation algorithm whose outputs are provably no worse than those of the trivial algorithm, but often substantially better. A family of problem instances is presented for which our improved algorithm produces optimal solutions, while, for any ε, the trivial algorithm cannot attain a (2- ε)-approximation. These results may lead to the first polynomial-time approximation algorithm that solves the CP-net aggregation problem for swaps with an approximation ratio substantially better than 2. \ No newline at end of file diff --git a/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams b/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams new file mode 100644 index 0000000000..de5c7fb911 --- /dev/null +++ b/data/2024/aaai/Approximation Scheme for Weighted Metric Clustering via Sherali-Adams @@ -0,0 +1,3 @@ +Motivated by applications to classification problems on metric data, we study Weighted Metric Clustering problem: given a metric d over n points and a k x k symmetric matrix A with non-negative entries, the goal is to find a k-partition of these points into clusters C1,...,Ck, while minimizing the sum of A[i,j] * d(u,v) over all pairs of clusters Ci and Cj and all pairs of points u from Ci and v from Cj. Specific choices of A lead to Weighted Metric Clustering capturing well-studied graph partitioning problems in metric spaces, such as Min-Uncut, Min-k-Sum, Min-k-Cut, and more. + +Our main result is that Weighted Metric Clustering admits a polynomial-time approximation scheme (PTAS). Our algorithm handles all the above problems using the Sherali-Adams linear programming relaxation. This subsumes several prior works, unifies many of the techniques for various metric clustering objectives, and yields a PTAS for several new problems, including metric clustering on manifolds and a new family of hierarchical clustering objectives. Our experiments on the hierarchical clustering objective show that it better captures the ground-truth structural information compared to the popular Dasgupta's objective. \ No newline at end of file diff --git a/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification b/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification new file mode 100644 index 0000000000..4416010711 --- /dev/null +++ b/data/2024/aaai/Arbitrariness and Social Prediction: The Confounding Role of Variance in Fair Classification @@ -0,0 +1 @@ +Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions. We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification. \ No newline at end of file diff --git a/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning b/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning new file mode 100644 index 0000000000..e62dcf39a0 --- /dev/null +++ b/data/2024/aaai/Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning @@ -0,0 +1 @@ +Recently, arbitrary-scale point cloud upsampling mechanism became increasingly popular due to its efficiency and convenience for practical applications. To achieve this, most previous approaches formulate it as a problem of surface approximation and employ point-based networks to learn surface representations. However, learning surfaces from sparse point clouds is more challenging, and thus they often suffer from the low-fidelity geometry approximation. To address it, we propose an arbitrary-scale Point cloud Upsampling framework using Voxel-based Network (PU-VoxelNet). Thanks to the completeness and regularity inherited from the voxel representation, voxel-based networks are capable of providing predefined grid space to approximate 3D surface, and an arbitrary number of points can be reconstructed according to the predicted density distribution within each grid cell. However, we investigate the inaccurate grid sampling caused by imprecise density predictions. To address this issue, a density-guided grid resampling method is developed to generate high-fidelity points while effectively avoiding sampling outliers. Further, to improve the fine-grained details, we present an auxiliary training supervision to enforce the latent geometric consistency among local surface patches. Extensive experiments indicate the proposed approach outperforms the state-of-the-art approaches not only in terms of fixed upsampling rates but also for arbitrary-scale upsampling. The code is available at https://github.com/hikvision-research/3DVision \ No newline at end of file diff --git a/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context b/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context new file mode 100644 index 0000000000..b8ab134b54 --- /dev/null +++ b/data/2024/aaai/Arbitrary-Scale Video Super-resolution Guided by Dynamic Context @@ -0,0 +1 @@ +We propose a Dynamic Context-Guided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR. \ No newline at end of file diff --git a/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization b/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization new file mode 100644 index 0000000000..05dcc3f391 --- /dev/null +++ b/data/2024/aaai/Are You Concerned about Limited Function Evaluations: Data-Augmented Pareto Set Learning for Expensive Multi-Objective Optimization @@ -0,0 +1 @@ +Optimizing multiple conflicting black-box objectives simultaneously is a prevalent occurrence in many real-world applications, such as neural architecture search, and machine learning. These problems are known as expensive multi-objective optimization problems (EMOPs) when the function evaluations are computationally or financially costly. Multi-objective Bayesian optimization (MOBO) offers an efficient approach to discovering a set of Pareto optimal solutions. However, the data deficiency issue caused by limited function evaluations has posed a great challenge to current optimization methods. Moreover, most current methods tend to prioritize the quality of candidate solutions, while ignoring the quantity of promising samples. In order to tackle these issues, our paper proposes a novel multi-objective Bayesian optimization algorithm with a data augmentation strategy that provides ample high-quality samples for Pareto set learning (PSL). Specifically, it utilizes Generative Adversarial Networks (GANs) to enrich data and a dominance prediction model to screen out high-quality samples, mitigating the predicament of limited function evaluations in EMOPs. Additionally, we adopt the regularity model to expensive multi-objective Bayesian optimization for PSL. Experimental results on both synthetic and real-world problems demonstrate that our algorithm outperforms several state-of-the-art and classical algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning b/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning new file mode 100644 index 0000000000..9a611f22e7 --- /dev/null +++ b/data/2024/aaai/Arithmetic Feature Interaction Is Necessary for Deep Tabular Learning @@ -0,0 +1 @@ +Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in fine-grained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer. \ No newline at end of file diff --git a/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank b/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank new file mode 100644 index 0000000000..b777917bea --- /dev/null +++ b/data/2024/aaai/ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank @@ -0,0 +1 @@ +Artistic style transfer aims to repaint the content image with the learned artistic style. Existing artistic style transfer methods can be divided into two categories: small model-based approaches and pre-trained large-scale model-based approaches. Small model-based approaches can preserve the content strucuture, but fail to produce highly realistic stylized images and introduce artifacts and disharmonious patterns; Pre-trained large-scale model-based approaches can generate highly realistic stylized images but struggle with preserving the content structure. To address the above issues, we propose ArtBank, a novel artistic style transfer framework, to generate highly realistic stylized images while preserving the content structure of the content images. Specifically, to sufficiently dig out the knowledge embedded in pre-trained large-scale models, an Implicit Style Prompt Bank (ISPB), a set of trainable parameter matrices, is designed to learn and store knowledge from the collection of artworks and behave as a visual prompt to guide pre-trained large-scale models to generate highly realistic stylized images while preserving content structure. Besides, to accelerate training the above ISPB, we propose a novel Spatial-Statistical-based self-Attention Module (SSAM). The qualitative and quantitative experiments demonstrate the superiority of our proposed method over state-of-the-art artistic style transfer methods. Code is available at https://github.com/Jamie-Cheung/ArtBank. \ No newline at end of file diff --git a/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges b/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges new file mode 100644 index 0000000000..2438252604 --- /dev/null +++ b/data/2024/aaai/Artificial Intelligence in the CS2023 Undergraduate Computer Science Curriculum: Rationale and Challenges @@ -0,0 +1 @@ +Roughly every decade, the ACM and IEEE professional organizations have produced recommendations for the education of undergraduate computer science students. These guidelines are used worldwide by research universities, liberal arts colleges, and community colleges. For the latest 2023 revision of the curriculum, AAAI has collaborated with ACM and IEEE to integrate artificial intelligence more broadly into this new curriculum and to address the issues it raises for students, instructors, practitioners, policy makers, and the general public. This paper describes the development process and rationale that underlie the artificial intelligence components of the CS2023 curriculum, discusses the challenges in curriculum design for such a rapidly advancing field, and examines lessons learned during this three-year process. \ No newline at end of file diff --git a/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations b/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations new file mode 100644 index 0000000000..2c6035ea7c --- /dev/null +++ b/data/2024/aaai/Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations @@ -0,0 +1 @@ +Aspect-based sentiment analysis (ABSA), a fine-grained sentiment classification task, has received much attention recently. Many works investigate sentiment information through opinion words, such as "good'' and "bad''. However, implicit sentiment data widely exists in the ABSA dataset, whose sentiment polarity is hard to determine due to the lack of distinct opinion words. To deal with implicit sentiment, this paper proposes an ABSA method that integrates explicit sentiment augmentations (ABSA-ESA) to add more sentiment clues. We propose an ABSA-specific explicit sentiment generation method to create such augmentations. Specifically, we post-train T5 by rule-based data and employ three strategies to constrain the sentiment polarity and aspect term of the generated augmentations. We employ Syntax Distance Weighting and Unlikelihood Contrastive Regularization in the training procedure to guide the model to generate the explicit opinion words with the same polarity as the input sentence. Meanwhile, we utilize the Constrained Beam Search to ensure the augmentations are aspect-related. We test ABSA-ESA on two ABSA benchmarks. The results show that ABSA-ESA outperforms the SOTA baselines on implicit and explicit sentiment accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Assume-Guarantee Reinforcement Learning b/data/2024/aaai/Assume-Guarantee Reinforcement Learning new file mode 100644 index 0000000000..5ce95fc823 --- /dev/null +++ b/data/2024/aaai/Assume-Guarantee Reinforcement Learning @@ -0,0 +1 @@ +We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assume-guarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assume-guarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies. \ No newline at end of file diff --git a/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval b/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval new file mode 100644 index 0000000000..366657e047 --- /dev/null +++ b/data/2024/aaai/Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval @@ -0,0 +1 @@ +In recent years, many methods have been proposed to address the zero-shot sketch-based image retrieval (ZS-SBIR) task, which is a practical problem in many applications. However, in real-world scenarios, on the one hand, we can not obtain training data with the same distribution as the test data, and on the other hand, the labels of training data are not available as usual. To tackle this issue, we focus on a new problem, namely unsupervised zero-shot sketch-based image retrieval (UZS-SBIR), where the available training data does not have labels while the training and testing categories are not overlapping. In this paper, we introduce a new asymmetric mutual alignment method (AMA) including a self-distillation module and a cross-modality mutual alignment module. First, we conduct self-distillation to extract the feature embeddings from unlabeled data. Due to the lack of available information in an unsupervised manner, we employ the cross-modality mutual alignment module to further excavate underlying intra-modality and inter-modality relationships from unlabeled data, and take full advantage of these correlations to align the feature embeddings in image and sketch domains. Meanwhile, the feature representations are enhanced by the intra-modality clustering relations, leading to better generalization ability to unseen classes. Moreover, we conduct an asymmetric strategy to update the teacher and student networks, respectively. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method. \ No newline at end of file diff --git a/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation b/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation new file mode 100644 index 0000000000..e9ce1d1e6c --- /dev/null +++ b/data/2024/aaai/Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation @@ -0,0 +1 @@ +Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract) b/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract) new file mode 100644 index 0000000000..78b358b8ca --- /dev/null +++ b/data/2024/aaai/Attacking CNNs in Histopathology with SNAP: Sporadic and Naturalistic Adversarial Patches (Student Abstract) @@ -0,0 +1,11 @@ +Convolutional neural networks (CNNs) are being increasingly +adopted in medical imaging. However, in the race for +developing accurate models, their robustness is often overlooked. +This elicits a significant concern given the safety-critical +nature of the healthcare system. Here, we highlight +the vulnerability of CNNs against a sporadic and naturalistic +adversarial patch attack (SNAP). We train SNAP to mislead +the ResNet50 model predicting metastasis in histopathological +scans of lymph node sections, lowering the accuracy by +27%. This work emphasizes the need for defense strategies +before deploying CNNs in critical healthcare settings. \ No newline at end of file diff --git a/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation b/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation new file mode 100644 index 0000000000..fcb82c5748 --- /dev/null +++ b/data/2024/aaai/Attacking Transformers with Feature Diversity Adversarial Perturbation @@ -0,0 +1 @@ +Understanding the mechanisms behind Vision Transformer (ViT), particularly its vulnerability to adversarial perturbations, is crucial for addressing challenges in its real-world applications. Existing ViT adversarial attackers rely on labels to calculate the gradient for perturbation, and exhibit low transferability to other structures and tasks. In this paper, we present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models, including most ViT variants, CNNs, and MLPs, even for models developed for other modalities. Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features, causing the features in middle-to-end layers to become increasingly similar and eventually collapse. We propose the feature diversity attacker to naturally accelerate this process and achieve remarkable performance and transferability. \ No newline at end of file diff --git a/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples b/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples new file mode 100644 index 0000000000..31b87bad39 --- /dev/null +++ b/data/2024/aaai/Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples @@ -0,0 +1 @@ +As an essential computer vision task, Continual Semantic Segmentation (CSS) has received a lot of attention. However, security issues regarding this task have not been fully studied. To bridge this gap, we study the problem of attacks in CSS in this paper. We first propose a new task, namely, attacks on incremental samples in CSS, and reveal that the attacks on incremental samples corrupt the performance of CSS in both old and new classes. Moreover, we present an adversarial sample generation method based on class shift, namely Class Shift Attack (CS-Attack), which is an offline and easy-to-implement approach for CSS. CS-Attack is able to significantly degrade the performance of models on both old and new classes without knowledge of the incremental learning approach, which undermines the original purpose of the incremental learning, i.e., learning new classes while retaining old knowledge. Experiments show that on the popular datasets Pascal VOC, ADE20k, and Cityscapes, our approach easily degrades the performance of currently popular CSS methods, which reveals the importance of security in CSS. \ No newline at end of file diff --git a/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention b/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention new file mode 100644 index 0000000000..3f9b993acd --- /dev/null +++ b/data/2024/aaai/Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention @@ -0,0 +1 @@ +Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test. \ No newline at end of file diff --git a/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction b/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction new file mode 100644 index 0000000000..769456f379 --- /dev/null +++ b/data/2024/aaai/Attention-Based Models for Snow-Water Equivalent Prediction @@ -0,0 +1 @@ +Snow Water-Equivalent (SWE)—the amount of water available if snowpack is melted—is a key decision variable used by water management agencies to make irrigation, flood control, power generation, and drought management decisions. SWE values vary spatiotemporally—affected by weather, topography, and other environmental factors. While daily SWE can be measured by Snow Telemetry (SNOTEL) stations with requisite instrumentation, such stations are spatially sparse requiring interpolation techniques to create spatiotemporal complete data. While recent efforts have explored machine learning (ML) for SWE prediction, a number of recent ML advances have yet to be considered. The main contribution of this paper is to explore one such ML advance, attention mechanisms, for SWE prediction. Our hypothesis is that attention has a unique ability to capture and exploit correlations that may exist across locations or the temporal spectrum (or both). We present a generic attention-based modeling framework for SWE prediction and adapt it to capture spatial attention and temporal attention. Our experimental results on 323 SNOTEL stations in the Western U.S. demonstrate that our attention-based models outperform other machine-learning approaches. We also provide key results highlighting the differences between spatial and temporal attention in this context and a roadmap toward deployment for generating spatially-complete SWE maps. \ No newline at end of file diff --git a/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification b/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification new file mode 100644 index 0000000000..e5cb6f20e1 --- /dev/null +++ b/data/2024/aaai/Attention-Induced Embedding Imputation for Incomplete Multi-View Partial Multi-Label Classification @@ -0,0 +1 @@ +As a combination of emerging multi-view learning methods and traditional multi-label classification tasks, multi-view multi-label classification has shown broad application prospects. The diverse semantic information contained in heterogeneous data effectively enables the further development of multi-label classification. However, the widespread incompleteness problem on multi-view features and labels greatly hinders the practical application of multi-view multi-label classification. Therefore, in this paper, we propose an attention-induced missing instances imputation technique to enhance the generalization ability of the model. Different from existing incomplete multi-view completion methods, we attempt to approximate the latent features of missing instances in embedding space according to cross-view joint attention, instead of recovering missing views in kernel space or original feature space. Accordingly, multi-view completed features are dynamically weighted by the confidence derived from joint attention in the late fusion phase. In addition, we propose a multi-view multi-label classification framework based on label-semantic feature learning, utilizing the statistical weak label correlation matrix and graph attention network to guide the learning process of label-specific features. Finally, our model is compatible with missing multi-view and partial multi-label data simultaneously and extensive experiments on five datasets confirm the advancement and effectiveness of our embedding imputation method and multi-view multi-label classification model. \ No newline at end of file diff --git a/data/2024/aaai/Attribute-Missing Graph Clustering Network b/data/2024/aaai/Attribute-Missing Graph Clustering Network new file mode 100644 index 0000000000..0f01a2d7d8 --- /dev/null +++ b/data/2024/aaai/Attribute-Missing Graph Clustering Network @@ -0,0 +1 @@ +Deep clustering with attribute-missing graphs, where only a subset of nodes possesses complete attributes while those of others are missing, is an important yet challenging topic in various practical applications. It has become a prevalent learning paradigm in existing studies to perform data imputation first and subsequently conduct clustering using the imputed information. However, these ``two-stage" methods disconnect the clustering and imputation processes, preventing the model from effectively learning clustering-friendly graph embedding. Furthermore, they are not tailored for clustering tasks, leading to inferior clustering results. To solve these issues, we propose a novel Attribute-Missing Graph Clustering (AMGC) method to alternately promote clustering and imputation in a unified framework, where we iteratively produce the clustering-enhanced nearest neighbor information to conduct the data imputation process and utilize the imputed information to implicitly refine the clustering distribution through model optimization. Specifically, in the imputation step, we take the learned clustering information as imputation prompts to help each attribute-missing sample gather highly correlated features within its clusters for data completion, such that the intra-class compactness can be improved. Moreover, to support reliable clustering, we maximize inter-class separability by conducting cost-efficient dual non-contrastive learning over the imputed latent features, which in turn promotes greater graph encoding capability for clustering sub-network. Extensive experiments on five datasets have verified the superiority of AMGC against competitors. \ No newline at end of file diff --git a/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model b/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model new file mode 100644 index 0000000000..707f36e0ab --- /dev/null +++ b/data/2024/aaai/Audio Generation with Multiple Conditional Diffusion Model @@ -0,0 +1 @@ +Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. \ No newline at end of file diff --git a/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification b/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification new file mode 100644 index 0000000000..9307370366 --- /dev/null +++ b/data/2024/aaai/Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification @@ -0,0 +1 @@ +With the rapid growth of audio data, there's a pressing need for automatic audio classification. As a type of time-series data, audio exhibits waveform fluctuations in both the time and frequency domains that evolve over time, with similar instances sharing consistent patterns. This study introduces the Audio Scanning Network (ASNet), designed to leverage abundant information for achieving stable and effective audio classification. ASNet captures real-time changes in audio waveforms across both time and frequency domains through reservoir computing, supported by Reservoir Kernel Canonical Correlation Analysis (RKCCA) to explore correlations between time-domain and frequency-domain waveform fluctuations. This innovative approach empowers ASNet to comprehensively capture the changes and inherent correlations within the audio waveform, and without the need for time-consuming iterative training. Instead of converting audio into spectrograms, ASNet directly utilizes audio feature sequences to uncover associations between time and frequency fluctuations. Experiments on environmental sound and music genre classification tasks demonstrate ASNet's comparable performance to state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head b/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head new file mode 100644 index 0000000000..6724ab9392 --- /dev/null +++ b/data/2024/aaai/AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head @@ -0,0 +1 @@ +Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving 16 AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Code can be found in https://github.com/AIGC-Audio/AudioGPT \ No newline at end of file diff --git a/data/2024/aaai/Auditable Algorithms for Approximate Model Counting b/data/2024/aaai/Auditable Algorithms for Approximate Model Counting new file mode 100644 index 0000000000..6aea2b0c4f --- /dev/null +++ b/data/2024/aaai/Auditable Algorithms for Approximate Model Counting @@ -0,0 +1,5 @@ +The problem of model counting, i.e., counting satisfying assignments of a Boolean formula, is a fundamental problem in computer science, with diverse applications. Given #P-hardness of the problem, many algorithms have been developed over the years to provide an approximate model count. Recently, building on the practical success of SAT-solvers used as NP oracles, the focus has shifted from theory to practical implementations of such algorithms. This has brought to focus new challenges. In this paper, we consider one such challenge – that of auditable deterministic approximate model counters wherein a counter should also generate a certificate, which allows a user (often with limited computational power) to independently audit whether the count returned by an invocation of the algorithm is indeed within the promised bounds. + +We start by examining a celebrated approximate model counting algorithm due to Stockmeyer that uses polynomially many calls to a \Sigma^2_P oracle, and show that it can be audited via a \Pi^2_P formula on (n^2 log^2 n) variables, where n is the number of variables in the original formula. Since n is often large (10’s to 100’s of thousands) for typical instances, we ask if the count of variables in the certificate formula can be reduced – a critical question towards potential implementation. We show that this improvement in certification can be achieved with a tradeoff in the counting algorithm’s complexity. Specifically, we develop new deterministic approximate model counting algorithms that invoke a \Sigma^3_P oracle, but can be certified using a \Pi^2_P formula on fewer variables: our final algorithm uses just (n log n) variables. + +Our study demonstrates that one can simplify certificate checking significantly if we allow the counting algorithm to access a slightly more powerful oracle. We believe this shows for the first time how the audit complexity can be traded for the complexity of approximate counting. \ No newline at end of file diff --git a/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding b/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding new file mode 100644 index 0000000000..21d18fe150 --- /dev/null +++ b/data/2024/aaai/Augmented Commonsense Knowledge for Remote Object Grounding @@ -0,0 +1 @@ +The vision-and-language navigation (VLN) task necessitates an agent to perceive the surroundings, follow natural language instructions, and act in photo-realistic unseen environments. Most of the existing methods employ the entire image or object features to represent navigable viewpoints. However, these representations are insufficient for proper action prediction, especially for the REVERIE task, which uses concise high-level instructions, such as “Bring me the blue cushion in the master bedroom”. To address enhancing representation, we propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as a spatio-temporal knowledge graph for improving agent navigation. Specifically, the proposed approach involves constructing a knowledge base by retrieving commonsense information from ConceptNet, followed by a refinement module to remove noisy and irrelevant knowledge. We further present ACK which consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment by integrating visible objects, commonsense knowledge, and concept history, which includes object and knowledge temporal information. Moreover, we add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction. Experimental results demonstrate our proposed model noticeably outperforms the baseline and archives the state-of-the-art on the REVERIE benchmark. The source code is available at https://github.com/Bahram-Mohammadi/ACK. \ No newline at end of file diff --git a/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery b/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery new file mode 100644 index 0000000000..b5f09f3071 --- /dev/null +++ b/data/2024/aaai/Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery @@ -0,0 +1 @@ +The substantial success of Vision Transformer (ViT) in computer vision tasks is largely attributed to the architecture design. This underscores the necessity of efficient architecture search for designing better ViTs automatically. As training-based architecture search methods are computationally intensive, there’s a growing interest in training-free methods that use zero-cost proxies to score ViTs. However, existing training-free approaches require expert knowledge to manually design specific zero-cost proxies. Moreover, these zero-cost proxies exhibit limitations to generalize across diverse domains. In this paper, we introduce Auto-Prox, an automatic proxy discovery framework, to address the problem. First, we build the ViT-Bench-101, which involves different ViT candidates and their actual performance on multiple datasets. Utilizing ViT-Bench-101, we can evaluate zero-cost proxies based on their score-accuracy correlation. Then, we represent zero-cost proxies with computation graphs and organize the zero-cost proxy search space with ViT statistics and primitive operations. To discover generic zero-cost proxies, we propose a joint correlation metric to evolve and mutate different zero-cost proxy candidates. We introduce an elitism-preserve strategy for search efficiency to achieve a better trade-off between exploitation and exploration. Based on the discovered zero-cost proxy, we conduct a ViT architecture search in a training-free manner. Extensive experiments demonstrate that our method generalizes well to different datasets and achieves state-of-the-art results both in ranking correlation and final accuracy. Codes can be found at https://github.com/lilujunai/Auto-Prox-AAAI24. \ No newline at end of file diff --git a/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls b/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls new file mode 100644 index 0000000000..cd329dd4c0 --- /dev/null +++ b/data/2024/aaai/Auto311: A Confidence-Guided Automated System for Non-emergency Calls @@ -0,0 +1 @@ +Emergency and non-emergency response systems are essential services provided by local governments and critical to protecting lives, the environment, and property. The effective handling of (non-)emergency calls is critical for public safety and well-being. By reducing the burden through non-emergency callers, residents in critical need of assistance through 911 will receive a fast and effective response. Collaborating with the Department of Emergency Communications (DEC) in Nashville, we analyzed 11,796 non-emergency call recordings and developed Auto311, the first automated system to handle 311 non-emergency calls, which (1) effectively and dynamically predicts ongoing non-emergency incident types to generate tailored case reports during the call; (2) itemizes essential information from dialogue contexts to complete the generated reports; and (3) strategically structures system-caller dialogues with optimized confidence. We used real-world data to evaluate the system's effectiveness and deployability. The experimental results indicate that the system effectively predicts incident type with an average F-1 score of 92.54%. Moreover, the system successfully itemizes critical information from relevant contexts to complete reports, evincing a 0.93 average consistency score compared to the ground truth. Additionally, emulations demonstrate that the system effectively decreases conversation turns as the utterance size gets more extensive and categorizes the ongoing call with 94.49% mean accuracy. \ No newline at end of file diff --git a/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing b/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing new file mode 100644 index 0000000000..6e2cbdd09b --- /dev/null +++ b/data/2024/aaai/AutoLTS: Automating Cycling Stress Assessment via Contrastive Learning and Spatial Post-processing @@ -0,0 +1 @@ +Cycling stress assessment, which quantifies cyclists' perceived stress imposed by the built environment and motor traffics, increasingly informs cycling infrastructure planning and cycling route recommendation. However, currently calculating cycling stress is slow and data-intensive, which hinders its broader application. In this paper, We propose a deep learning framework to support accurate, fast, and large-scale cycling stress assessments for urban road networks based on street-view images. Our framework features i) a contrastive learning approach that leverages the ordinal relationship among cycling stress labels, and ii) a post-processing technique that enforces spatial smoothness into our predictions. On a dataset of 39,153 road segments collected in Toronto, Canada, our results demonstrate the effectiveness of our deep learning framework and the value of using image data for cycling stress assessment in the absence of high-quality road geometry and motor traffic data. \ No newline at end of file diff --git a/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data b/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data new file mode 100644 index 0000000000..7f48b65a34 --- /dev/null +++ b/data/2024/aaai/AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data @@ -0,0 +1 @@ +The efficiency of business processes relies on business key performance indicators (Biz-KPIs), that can be negatively impacted by IT failures. Business and IT Observability (BizITObs) data fuses both Biz-KPIs and IT event channels together as multivariate time series data. Forecasting Biz-KPIs in advance can enhance efficiency and revenue through proactive corrective measures. However, BizITObs data generally exhibit both useful and noisy inter-channel interactions between Biz-KPIs and IT events that need to be effectively decoupled. This leads to suboptimal forecasting performance when existing multivariate forecasting models are employed. To address this, we introduce AutoMixer, a time-series Foundation Model (FM) approach, grounded on the novel technique of channel-compressed pretrain and finetune workflows. AutoMixer leverages an AutoEncoder for channel-compressed pretraining and integrates it with the advanced TSMixer model for multivariate time series forecasting. This fusion greatly enhances the potency of TSMixer for accurate forecasts and also generalizes well across several downstream tasks. Through detailed experiments and dashboard analytics, we show AutoMixer's capability to consistently improve the Biz-KPI's forecasting accuracy (by 11-15%) which directly translates to actionable business insights. \ No newline at end of file diff --git a/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract) b/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract) new file mode 100644 index 0000000000..201d5883af --- /dev/null +++ b/data/2024/aaai/Automated Assessment of Fidelity and Interpretability: An Evaluation Framework for Large Language Models' Explanations (Student Abstract) @@ -0,0 +1 @@ +As Large Language Models (LLMs) become more prevalent in various fields, it is crucial to rigorously assess the quality of their explanations. Our research introduces a task-agnostic framework for evaluating free-text rationales, drawing on insights from both linguistics and machine learning. We evaluate two dimensions of explainability: fidelity and interpretability. For fidelity, we propose methods suitable for proprietary LLMs where direct introspection of internal features is unattainable. For interpretability, we use language models instead of human evaluators, addressing concerns about subjectivity and scalability in evaluations. We apply our framework to evaluate GPT-3.5 and the impact of prompts on the quality of its explanations. In conclusion, our framework streamlines the evaluation of explanations from LLMs, promoting the development of safer models. \ No newline at end of file diff --git a/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control b/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control new file mode 100644 index 0000000000..3147059739 --- /dev/null +++ b/data/2024/aaai/Automated Defect Report Generation for Enhanced Industrial Quality Control @@ -0,0 +1 @@ +Defect detection is a pivotal aspect ensuring product quality and production efficiency in industrial manufacturing. Existing studies on defect detection predominantly focus on locating defects through bounding boxes and classifying defect types. However, their methods can only provide limited information and fail to meet the requirements for further processing after detecting defects. To this end, we propose a novel task called defect detection report generation, which aims to provide more comprehensive and informative insights into detected defects in the form of text reports. For this task, we propose some new datasets, which contain 16 different materials and each defect contains a detailed report of human constructs. In addition, we propose a knowledge-aware report generation model as a baseline for future research, which aims to incorporate additional knowledge to generate detailed analysis and subsequent processing related to defect in images. By constructing defect report datasets and proposing corresponding baselines, we chart new directions for future research and practical applications of this task. \ No newline at end of file diff --git a/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings b/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings new file mode 100644 index 0000000000..b4f09becb4 --- /dev/null +++ b/data/2024/aaai/Automated Design of Affine Maximizer Mechanisms in Dynamic Settings @@ -0,0 +1,4 @@ +Dynamic mechanism design is a challenging extension to ordinary mechanism design in which the mechanism designer must make a sequence of decisions over time in the face of possibly untruthful reports of participating agents. +Optimizing dynamic mechanisms for welfare is relatively well understood. However, there has been less work on optimizing for other goals (e.g., revenue), and without restrictive assumptions on valuations, it is remarkably challenging to characterize good mechanisms. Instead, we turn to automated mechanism design to find mechanisms with good performance in specific problem instances. +We extend the class of affine maximizer mechanisms to MDPs where agents may untruthfully report their rewards. This extension results in a challenging bilevel optimization problem in which the upper problem involves choosing optimal mechanism parameters, and the lower problem involves solving the resulting MDP. +Our approach can find truthful dynamic mechanisms that achieve strong performance on goals other than welfare, and can be applied to essentially any problem setting---without restrictions on valuations---for which RL can learn optimal policies. \ No newline at end of file diff --git a/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract) b/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract) new file mode 100644 index 0000000000..64b4fc22c1 --- /dev/null +++ b/data/2024/aaai/Automated Natural Language Explanation of Deep Visual Neurons with Large Models (Student Abstract) @@ -0,0 +1 @@ +Interpreting deep neural networks through examining neurons offers distinct advantages when it comes to exploring the inner workings of Deep Neural Networks. Previous research has indicated that specific neurons within deep vision networks possess semantic meaning and play pivotal roles in model performance. Nonetheless, the current methods for generating neuron semantics heavily rely on human intervention, which hampers their scalability and applicability. To address this limitation, this paper proposes a novel post-hoc framework for generating semantic explanations of neurons with large foundation models, without requiring human intervention or prior knowledge. Experiments are conducted with both qualitative and quantitative analysis to verify the effectiveness of our proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning b/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning new file mode 100644 index 0000000000..8d731b1629 --- /dev/null +++ b/data/2024/aaai/Automated State Estimation for Summarizing the Dynamics of Complex Urban Systems Using Representation Learning @@ -0,0 +1,4 @@ +Complex urban systems can be difficult to monitor, diagnose and manage because the complete states of such systems are only partially observable with sensors. State estimation techniques can be used to determine the +underlying dynamic behavior of such complex systems with their highly non-linear processes and external time-variant influences. +States can be estimated by clustering observed sensor readings. However, +clustering performance degrades as the number of sensors and readings (i.e. feature dimension) increases. To address this problem, we propose a framework that learns a feature-centric lower dimensional representation of data for clustering to support analysis of system dynamics. We propose Unsupervised Feature Attention with Compact Representation (UFACR) to rank features contributing to a cluster assignment. These weighted features are then used to learn a reduced-dimension temporal representation of the data with a deep-learning model. The resulting low-dimensional representation can be effectively clustered into states. UFACR is evaluated on real-world and synthetic wastewater treatment plant data sets, and feature ranking outcomes were validated by Wastewater treatment domain experts. Our quantitative and qualitative experimental analyses demonstrate the effectiveness of UFACR for uncovering system dynamics in an automated and unsupervised manner to offer guidance to wastewater engineers to enhance industrial productivity and treatment efficiency. \ No newline at end of file diff --git a/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning b/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning new file mode 100644 index 0000000000..59b67de4f7 --- /dev/null +++ b/data/2024/aaai/Automatic Core-Guided Reformulation via Constraint Explanation and Condition Learning @@ -0,0 +1,10 @@ +SAT and propagation solvers often underperform for optimisation models whose objective sums many single-variable terms. +MaxSAT solvers avoid this by detecting and exploiting cores: subsets of these terms that cannot collectively take their lower bounds. +Previous work has shown manual analysis of cores can help define model reformulations likely to speed up solving for many model instances. +This paper presents a method to automate this process. +For each selected core the method identifies the instance constraints that caused it; +infers the model constraints and parameters that explain how these instance constraints were formed; +and learns the conditions that made those model constraint instances generate cores, while others did not. +It then uses this information to reformulate the objective. +The empirical evaluation shows this method can produce useful reformulations. +Importantly, the method can be useful in many other situations that require explaining a set of constraints. \ No newline at end of file diff --git a/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis b/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis new file mode 100644 index 0000000000..6928e1e916 --- /dev/null +++ b/data/2024/aaai/Automatic Interpretation of Line Probe Assay Test for Tuberculosis @@ -0,0 +1 @@ +Line Probe Assay (LPA) is a widely used method for diagnosing drug-resistant tuberculosis (DRTB), but it is a time-consuming and labor-intensive process that requires expert interpretation. DRTB is a significant threat to global TB control efforts and its prompt diagnosis is critical for initiating appropriate treatment. In this paper, we present an automated LPA test interpretation solution that uses computer vision techniques to extract and analyze strips from LPA sheets and uses machine learning algorithms to produce drug sensitivity and resistivity outcomes with extremely high precision and recall. We also develop OCR models to eliminate manual data entry to further reduce the overall time. Our solution comprises a rejection module that flags ambiguous and novel samples that are then referred to experienced lab technicians. This results in increased trust in the solution. To evaluate our solution, we curate an extensive and diverse dataset of LPA strips annotated by multiple microbiologists across India. Our solution achieves more than 95% accuracy for all drugs on this dataset. The proposed solution has the potential to increase the efficiency, standardization of LPA test interpretation, and fast-tracking the dissemination of results to end-users via a designated Management Information System (MIS). \ No newline at end of file diff --git a/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network b/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network new file mode 100644 index 0000000000..a87e84227b --- /dev/null +++ b/data/2024/aaai/Automatic Radiology Reports Generation via Memory Alignment Network @@ -0,0 +1 @@ +The automatic generation of radiology reports is of great significance, which can reduce the workload of doctors and improve the accuracy and reliability of medical diagnosis and treatment, and has attracted wide attention in recent years. Cross-modal mapping between images and text, a key component of generating high-quality reports, is challenging due to the lack of corresponding annotations. Despite its importance, previous studies have often overlooked it or lacked adequate designs for this crucial component. In this paper, we propose a method with memory alignment embedding to assist the model in aligning visual and textual features to generate a coherent and informative report. Specifically, we first get the memory alignment embedding by querying the memory matrix, where the query is derived from a combination of the visual features and their corresponding positional embeddings. Then the alignment between the visual and textual features can be guided by the memory alignment embedding during the generation process. The comparison experiments with other alignment methods show that the proposed alignment method is less costly and more effective. The proposed approach achieves better performance than state-of-the-art approaches on two public datasets IU X-Ray and MIMIC-CXR, which further demonstrates the effectiveness of the proposed alignment method. \ No newline at end of file diff --git a/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT b/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT new file mode 100644 index 0000000000..70c1ec0176 --- /dev/null +++ b/data/2024/aaai/Automatic Short Answer Grading for Finnish with ChatGPT @@ -0,0 +1 @@ +Automatic short answer grading (ASAG) seeks to mitigate the burden on teachers by leveraging computational methods to evaluate student-constructed text responses. Large language models (LLMs) have recently gained prominence across diverse applications, with educational contexts being no exception. The sudden rise of ChatGPT has raised expectations that LLMs can handle numerous tasks, including ASAG. This paper aims to shed some light on this expectation by evaluating two LLM-based chatbots, namely ChatGPT built on GPT-3.5 and GPT-4, on scoring short-question answers under zero-shot and one-shot settings. Our data consists of 2000 student answers in Finnish from ten undergraduate courses. Multiple perspectives are taken into account during this assessment, encompassing those of grading system developers, teachers, and students. On our dataset, GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, clearly outperforming GPT-3.5 at 21%. We observe a negative association between student answer length and model performance, as well as a correlation between a smaller standard deviation among a set of predictions and lower performance. We conclude that while GPT-4 exhibits signs of being a capable grader, additional research is essential before considering its deployment as a reliable autograder. \ No newline at end of file diff --git a/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models b/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models new file mode 100644 index 0000000000..34862f3c52 --- /dev/null +++ b/data/2024/aaai/Automatically Testing Functional Properties of Code Translation Models @@ -0,0 +1 @@ +Large language models are becoming increasingly practical for translating code across programming languages, a process known as transpiling. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations. \ No newline at end of file diff --git a/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming b/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming new file mode 100644 index 0000000000..3db650eede --- /dev/null +++ b/data/2024/aaai/Autonomous Policy Explanations for Effective Human-Machine Teaming @@ -0,0 +1 @@ +Policy explanation, a process for describing the behavior of an autonomous system, plays a crucial role in effectively conveying an agent's decision-making rationale to human collaborators and is essential for safe real-world deployments. It becomes even more critical in effective human-robot teaming, where good communication allows teams to adapt and improvise successfully during uncertain situations by enabling value alignment within the teams. This thesis proposal focuses on improving human-machine teaming by developing novel human-centered explainable AI (xAI) techniques that empower autonomous agents to communicate their capabilities and limitations via multiple modalities, teach and influence human teammates' behavior as decision-support systems, and effectively build and manage trust in HRI systems. \ No newline at end of file diff --git a/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation b/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation new file mode 100644 index 0000000000..9f41ae0635 --- /dev/null +++ b/data/2024/aaai/Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation @@ -0,0 +1 @@ +A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by outpainting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code is available at https://github.com/zhuqiangLu/AOG-NET-360. \ No newline at end of file diff --git a/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose b/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose new file mode 100644 index 0000000000..04d4388d99 --- /dev/null +++ b/data/2024/aaai/AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose @@ -0,0 +1 @@ +Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io/ . \ No newline at end of file diff --git a/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations b/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations new file mode 100644 index 0000000000..ebddbfeaf2 --- /dev/null +++ b/data/2024/aaai/Axiomatic Aggregations of Abductive Explanations @@ -0,0 +1 @@ +The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue --- there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME. \ No newline at end of file diff --git a/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation b/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation new file mode 100644 index 0000000000..278d529de9 --- /dev/null +++ b/data/2024/aaai/B-spine: Learning B-spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation @@ -0,0 +1 @@ +Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed JLU-CJUH dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation. \ No newline at end of file diff --git a/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving b/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving new file mode 100644 index 0000000000..d750b9c9fb --- /dev/null +++ b/data/2024/aaai/BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving @@ -0,0 +1,16 @@ +Artificial Intelligence for Theorem Proving (AITP) has given +rise to a plethora of benchmarks and methodologies, particularly in Interactive Theorem Proving (ITP). Research in the +area is fragmented, with a diverse set of approaches being +spread across several ITP systems. This presents a significant challenge to the comparison of methods, which are often +complex and difficult to replicate. +Addressing this, we present BAIT, a framework for the fair +and streamlined comparison of learning approaches in ITP. +We demonstrate BAIT’s capabilities with an in-depth comparison, across several ITP benchmarks, of state-of-the-art +architectures applied to the problem of formula embedding. +We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive +environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide +a qualitative analysis, illustrating that improved performance +is associated with more semantically-aware embeddings. By +streamlining the implementation and comparison of Machine +Learning algorithms in the ITP context, we anticipate BAIT +will be a springboard for future research. \ No newline at end of file diff --git a/data/2024/aaai/BAND: Biomedical Alert News Dataset b/data/2024/aaai/BAND: Biomedical Alert News Dataset new file mode 100644 index 0000000000..fe821f47a9 --- /dev/null +++ b/data/2024/aaai/BAND: Biomedical Alert News Dataset @@ -0,0 +1 @@ +Infectious disease outbreaks continue to pose a significant threat to human health and well-being. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better inference capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to demonstrate existing models' capabilities and limitations in handling epidemiology-specific tasks. It is worth noting that some models may lack the human-like inference capability required to fully utilize the corpus. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike. \ No newline at end of file diff --git a/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion b/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion new file mode 100644 index 0000000000..b284d48422 --- /dev/null +++ b/data/2024/aaai/BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion @@ -0,0 +1 @@ +Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties: (I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence. (II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability. (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics. By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized. In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods. \ No newline at end of file diff --git a/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence b/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence new file mode 100644 index 0000000000..f086dbe5c9 --- /dev/null +++ b/data/2024/aaai/BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence @@ -0,0 +1,3 @@ +Measuring the coherence of text is a vital aspect of evaluating the quality of written content. Recent advancements in neural coherence modeling have demonstrated their efficacy in capturing entity coreference and discourse relations, thereby enhancing coherence evaluation. However, many existing methods heavily depend on static embeddings or focus narrowly on nearby context, constraining their capacity to measure the overarching coherence of long texts. +In this paper, we posit that coherent texts inherently manifest a sequential and cohesive interplay among sentences, effectively conveying the central theme, purpose, or standpoint. To explore this abstract relationship, we introduce the "BB Score," a novel reference-free metric grounded in Brownian bridge theory for assessing text coherence. Our findings showcase that when synergized with a simple additional classification component, this metric attains a performance level comparable to state-of-the-art techniques on standard artificial discrimination tasks. +We also establish in downstream tasks that this metric effectively differentiates between human-written documents and text generated by large language models within specific domains. Furthermore, we illustrate the efficacy of this approach in detecting written styles attributed to various large language models, underscoring its potential for generalizability. In summary, we present a novel Brownian bridge coherence metric capable of measuring both local and global text coherence, while circumventing the need for end-to-end model training. This flexibility allows for its application in various downstream tasks. \ No newline at end of file diff --git a/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning b/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning new file mode 100644 index 0000000000..9b72b62e10 --- /dev/null +++ b/data/2024/aaai/BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning @@ -0,0 +1 @@ +Correspondence pruning aims to establish reliable correspondences between two related images and recover relative camera motion. Existing approaches often employ a progressive strategy to handle the local and global contexts, with a prominent emphasis on transitioning from local to global, resulting in the neglect of interactions between different contexts. To tackle this issue, we propose a parallel context learning strategy that involves acquiring bilateral consensus for the two-view correspondence pruning task. In our approach, we design a distinctive self-attention block to capture global context and parallel process it with the established local context learning module, which enables us to simultaneously capture both local and global consensuses. By combining these local and global consensuses, we derive the required bilateral consensus. We also design a recalibration block, reducing the influence of erroneous consensus information and enhancing the robustness of the model. The culmination of our efforts is the Bilateral Consensus Learning Network (BCLNet), which efficiently estimates camera pose and identifies inliers (true correspondences). Extensive experiments results demonstrate that our network not only surpasses state-of-the-art methods on benchmark datasets but also showcases robust generalization abilities across various feature extraction techniques. Noteworthily, BCLNet obtains significant improvement gains over the second best method on unknown outdoor dataset, and obviously accelerates model training speed. \ No newline at end of file diff --git a/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind b/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind new file mode 100644 index 0000000000..c438069355 --- /dev/null +++ b/data/2024/aaai/BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind @@ -0,0 +1 @@ +As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events, few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero-shot, few-shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis. \ No newline at end of file diff --git a/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope b/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope new file mode 100644 index 0000000000..b7de4b8773 --- /dev/null +++ b/data/2024/aaai/BERTground: A Transformer-Based Model of Background Spectra on the ISS-Based NICER Space Telescope @@ -0,0 +1 @@ +The Neutron star Interior Composition Explorer (NICER) is an International Space Station (ISS)-based Space Telescope developed by NASA and devoted to the study of high-energy X-Ray sources in the universe, including but not limited to neutron stars, pulsars, and black holes in stellar systems and active galactic nuclei (AGN). One prominent problem with NICER observations is the highly variable background spectra, obscuring actual signals of astrophysical sources and negatively affecting scientific analysis of the targets. Therefore, obtaining accurate estimations of the background spectra is crucial to filter the noise and facilitate better scientific discoveries of new astronomical objects. In this paper, we propose the very first Deep Neural Network architecture to model the NICER background spectra variation using information about the spacecraft and telescope associated with each observation. In particular, we develop a BERT-based architecture with tokenizers applied to different groups of features in our tabular dataset. We also introduce an adapted Tabular Deep Residual Network architecture as the predictor following the Transformer modules in our network. We show that our model outperforms the current state-of-the-art background model developed by the NICER team in most evaluation metrics. Finally, we discuss pathways and future work for the deployment of this model on NASA’s next versions of HEASARC Software packages. \ No newline at end of file diff --git a/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios b/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios new file mode 100644 index 0000000000..ea788f9cc7 --- /dev/null +++ b/data/2024/aaai/BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios @@ -0,0 +1 @@ +Existing LiDAR-based 3D object detection methods for autonomous driving scenarios mainly adopt the training-from-scratch paradigm. Unfortunately, this paradigm heavily relies on large-scale labeled data, whose collection can be expensive and time-consuming. Self-supervised pre-training is an effective and desirable way to alleviate this dependence on extensive annotated data. In this work, we present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation in a BEV perspective and avoid complex decoder design during pre-training. Furthermore, we introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder with fine-tuning for masked point cloud inputs. Based on the property of outdoor point clouds in autonomous driving scenarios, i.e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection. Experimental results show that BEV-MAE surpasses prior state-of-the-art self-supervised methods and achieves a favorably pre-training efficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves new state-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and 69.6 mAP on the nuScenes benchmark. The source code will be released at https://github.com/VDIGPKU/BEV-MAE. \ No newline at end of file diff --git a/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion b/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion new file mode 100644 index 0000000000..ea0c782953 --- /dev/null +++ b/data/2024/aaai/BLADE: Box-Level Supervised Amodal Segmentation through Directed Expansion @@ -0,0 +1 @@ +Perceiving the complete shape of occluded objects is essential for human and machine intelligence. While the amodal segmentation task is to predict the complete mask of partially occluded objects, it is time-consuming and labor-intensive to annotate the pixel-level ground truth amodal masks. Box-level supervised amodal segmentation addresses this challenge by relying solely on ground truth bounding boxes and instance classes as supervision, thereby alleviating the need for exhaustive pixel-level annotations. Nevertheless, current box-level methodologies encounter limitations in generating low-resolution masks and imprecise boundaries, failing to meet the demands of practical real-world applications. We present a novel solution to tackle this problem by introducing a directed expansion approach from visible masks to corresponding amodal masks. Our approach involves a hybrid end-to-end network based on the overlapping region - the area where different instances intersect. Diverse segmentation strategies are applied for overlapping regions and non-overlapping regions according to distinct characteristics. To guide the expansion of visible masks, we introduce an elaborately-designed connectivity loss for overlapping regions, which leverages correlations with visible masks and facilitates accurate amodal segmentation. Experiments are conducted on several challenging datasets and the results show that our proposed method can outperform existing state-of-the-art methods with large margins. \ No newline at end of file diff --git a/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions b/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions new file mode 100644 index 0000000000..85797206aa --- /dev/null +++ b/data/2024/aaai/BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions @@ -0,0 +1 @@ +Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA. \ No newline at end of file diff --git a/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling b/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling new file mode 100644 index 0000000000..2a2606d530 --- /dev/null +++ b/data/2024/aaai/BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling @@ -0,0 +1 @@ +Inferring the 3D structure of a non-rigid dynamic scene from a single moving camera is an under-constrained problem. Inspired by the remarkable progress of neural radiance fields (NeRFs) in photo-realistic novel view synthesis of static scenes, it has also been extended to dynamic settings. Such methods heavily rely on implicit neural priors to regularize the problem. In this work, we take a step back and investigate how current implementations may entail deleterious effects including limited expressiveness, entanglement of light and density fields, and sub-optimal motion localization. Further, we devise a factorisation-based framework that represents the scene as a composition of bandlimited, high-dimensional signals. We demonstrate compelling results across complex dynamic scenes that involve changes in lighting, texture and long-range dynamics. \ No newline at end of file diff --git a/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining b/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining new file mode 100644 index 0000000000..a4bf7a7324 --- /dev/null +++ b/data/2024/aaai/BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining @@ -0,0 +1 @@ +The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA. \ No newline at end of file diff --git a/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention b/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention new file mode 100644 index 0000000000..d6022ef1fd --- /dev/null +++ b/data/2024/aaai/BVT-IMA: Binary Vision Transformer with Information-Modified Attention @@ -0,0 +1 @@ +As a compression method that can significantly reduce the cost of calculations and memories, model binarization has been extensively studied in convolutional neural networks. However, the recently popular vision transformer models pose new challenges to such a technique, in which the binarized models suffer from serious performance drops. In this paper, an attention shifting is observed in the binary multi-head self-attention module, which can influence the information fusion between tokens and thus hurts the model performance. From the perspective of information theory, we find a correlation between attention scores and the information quantity, further indicating that a reason for such a phenomenon may be the loss of the information quantity induced by constant moduli of binarized tokens. Finally, we reveal the information quantity hidden in the attention maps of binary vision transformers and propose a simple approach to modify the attention values with look-up information tables so that improve the model performance. Extensive experiments on CIFAR-100/TinyImageNet/ImageNet-1k demonstrate the effectiveness of the proposed information-modified attention on binary vision transformers. \ No newline at end of file diff --git a/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning b/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning new file mode 100644 index 0000000000..c161e3e1e8 --- /dev/null +++ b/data/2024/aaai/BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning @@ -0,0 +1 @@ +Semi-supervised Learning (SSL) reduces the need for extensive annotations in deep learning, but the more realistic challenge of imbalanced data distribution in SSL remains largely unexplored. In Class Imbalanced Semi-supervised Learning (CISSL), the bias introduced by unreliable pseudo-labels can be exacerbated by imbalanced data distributions. Most existing methods address this issue at instance-level through reweighting or resampling, but the performance is heavily limited by their reliance on biased backbone representation. Some other methods do perform feature-level adjustments like feature blending but might introduce unfavorable noise. In this paper, we discuss the bonus of a more balanced feature distribution for the CISSL problem, and further propose a Balanced Feature-Level Contrastive Learning method (BaCon). Our method directly regularizes the distribution of instances' representations in a well-designed contrastive manner. Specifically, class-wise feature centers are computed as the positive anchors, while negative anchors are selected by a straightforward yet effective mechanism. A distribution-related temperature adjustment is leveraged to control the class-wise contrastive degrees dynamically. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR10-LT, CIFAR100-LT, STL10-LT, and SVHN-LT datasets across various settings. For example, BaCon surpasses instance-level method FixMatch-based ABC on CIFAR10-LT with a 1.21% accuracy improvement, and outperforms state-of-the-art feature-level method CoSSL on CIFAR100-LT with a 0.63% accuracy improvement. When encountering more extreme imbalance degree, BaCon also shows better robustness than other methods. \ No newline at end of file diff --git a/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations b/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations new file mode 100644 index 0000000000..69edaac9c7 --- /dev/null +++ b/data/2024/aaai/Backdoor Adjustment via Group Adaptation for Debiased Coupon Recommendations @@ -0,0 +1,2 @@ +Accurate prediction of coupon usage is crucial for promoting user consumption through targeted coupon recommendations. However, in real-world coupon recommendations, the coupon allocation process is not solely determined by the model trained with the history interaction data but is also interfered with by marketing tactics desired to fulfill specific commercial goals.This interference creates an imbalance in the interactions, which causes the data to deviate from the user's natural preferences. We refer to this deviation as the matching bias. Such biased interaction data affects the efficacy of the model, and thus it is necessary to employ debiasing techniques to prevent any negative impact. +We investigate the mitigation of matching bias in coupon recommendations from a causal-effect perspective. By treating the attributes of users and coupons associated with marketing tactics as confounders, we find the confounders open the backdoor path between user-coupon matching and the conversion, which introduces spurious correlation. To remove the bad effect, we propose a novel training paradigm named Backdoor Adjustment via Group Adaptation (BAGA) for debiased coupon recommendations, which performs intervened training and inference, i.e., separately modeling each user-coupon group pair. However, modeling all possible group pairs greatly increases the computational complexity and cost. To address the efficiency challenge, we further present a simple but effective dual-tower multi-task framework and leverage the Customized Gate Control (CGC) model architecture, which separately models each user and coupon group with a separate expert module. We instantiate BAGA on five representative models: FM, DNN, NCF, MASKNET, and DEEPFM, and conduct comprehensive offline and online experiments to demonstrate the efficacy of our proposed paradigm. \ No newline at end of file diff --git a/data/2024/aaai/Backdoor Attacks via Machine Unlearning b/data/2024/aaai/Backdoor Attacks via Machine Unlearning new file mode 100644 index 0000000000..9f618107a9 --- /dev/null +++ b/data/2024/aaai/Backdoor Attacks via Machine Unlearning @@ -0,0 +1 @@ +As a new paradigm to erase data from a model and protect user privacy, machine unlearning has drawn significant attention. However, existing studies on machine unlearning mainly focus on its effectiveness and efficiency, neglecting the security challenges introduced by this technique. In this paper, we aim to bridge this gap and study the possibility of conducting malicious attacks leveraging machine unlearning. Specifically, we consider the backdoor attack via machine unlearning, where an attacker seeks to inject a backdoor in the unlearned model by submitting malicious unlearning requests, so that the prediction made by the unlearned model can be changed when a particular trigger presents. In our study, we propose two attack approaches. The first attack approach does not require the attacker to poison any training data of the model. The attacker can achieve the attack goal only by requesting to unlearn a small subset of his contributed training data. The second approach allows the attacker to poison a few training instances with a pre-defined trigger upfront, and then activate the attack via submitting a malicious unlearning request. Both attack approaches are proposed with the goal of maximizing the attack utility while ensuring attack stealthiness. The effectiveness of the proposed attacks is demonstrated with different machine unlearning algorithms as well as different models on different datasets. \ No newline at end of file diff --git a/data/2024/aaai/Backpropagation Through Agents b/data/2024/aaai/Backpropagation Through Agents new file mode 100644 index 0000000000..9614d816dc --- /dev/null +++ b/data/2024/aaai/Backpropagation Through Agents @@ -0,0 +1 @@ +A fundamental challenge in multi-agent reinforcement learning (MARL) is to learn the joint policy in an extremely large search space, which grows exponentially with the number of agents. Moreover, fully decentralized policy factorization significantly restricts the search space, which may lead to sub-optimal policies. In contrast, the auto-regressive joint policy can represent a much richer class of joint policies by factorizing the joint policy into the product of a series of conditional individual policies. While such factorization introduces the action dependency among agents explicitly in sequential execution, it does not take full advantage of the dependency during learning. In particular, the subsequent agents do not give the preceding agents feedback about their decisions. In this paper, we propose a new framework Back-Propagation Through Agents (BPTA) that directly accounts for both agents' own policy updates and the learning of their dependent counterparts. This is achieved by propagating the feedback through action chains. With the proposed framework, our Bidirectional Proximal Policy Optimisation (BPPO) outperforms the state-of-the-art methods. Extensive experiments on matrix games, StarCraftII v2, Multi-agent MuJoCo, and Google Research Football demonstrate the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices b/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices new file mode 100644 index 0000000000..cba62a5d7a --- /dev/null +++ b/data/2024/aaai/Backward Responsibility in Transition Systems Using General Power Indices @@ -0,0 +1,5 @@ +To improve reliability and the understanding of AI systems, there is increasing interest in the use of formal methods, e.g. model checking. Model checking tools produce a counterexample when a model does not satisfy a property. Understanding these counterexamples is critical for efficient debugging, as it allows the developer to focus on the parts of the program that caused the issue. + +To this end, we present a new technique that ascribes a responsibility value to each state in a transition system that does not satisfy a given safety property. The value is higher if the non-deterministic choices in a state have more power to change the outcome, given the behaviour observed in the counterexample. For this, we employ a concept from cooperative game theory – namely general power indices, such as the Shapley value – to compute the responsibility of the states. + +We present an optimistic and pessimistic version of responsibility that differ in how they treat the states that do not lie on the counterexample. We give a characterisation of optimistic responsibility that leads to an efficient algorithm for it and show computational hardness of the pessimistic version. We also present a tool to compute responsibility and show how a stochastic algorithm can be used to approximate responsibility in larger models. These methods can be deployed in the design phase, at runtime and at inspection time to gain insights on causal relations within the behavior of AI systems. \ No newline at end of file diff --git a/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection b/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection new file mode 100644 index 0000000000..0e7ca8da75 --- /dev/null +++ b/data/2024/aaai/Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection @@ -0,0 +1 @@ +Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models. \ No newline at end of file diff --git a/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning b/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning new file mode 100644 index 0000000000..9ffc1e93a7 --- /dev/null +++ b/data/2024/aaai/BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning @@ -0,0 +1 @@ +Backdoor attacks in reinforcement learning (RL) have previously employed intense attack strategies to ensure attack success. However, these methods suffer from high attack costs and increased detectability. In this work, we propose a novel approach, BadRL, which focuses on conducting highly sparse backdoor poisoning efforts during training and testing while maintaining successful attacks. Our algorithm, BadRL, strategically chooses state observations with high attack values to inject triggers during training and testing, thereby reducing the chances of detection. In contrast to the previous methods that utilize sample-agnostic trigger patterns, BadRL dynamically generates distinct trigger patterns based on targeted state observations, thereby enhancing its effectiveness. Theoretical analysis shows that the targeted backdoor attack is always viable and remains stealthy under specific assumptions. Empirical results on various classic RL tasks illustrate that BadRL can substantially degrade the performance of a victim agent with minimal poisoning efforts (0.003% of total training steps) during training and infrequent attacks during testing. Code is available at: https://github.com/7777777cc/code. \ No newline at end of file diff --git a/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract) b/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract) new file mode 100644 index 0000000000..35464a15a4 --- /dev/null +++ b/data/2024/aaai/BadSAM: Exploring Security Vulnerabilities of SAM via Backdoor Attacks (Student Abstract) @@ -0,0 +1 @@ +Image segmentation is foundational to computer vision applications, and the Segment Anything Model (SAM) has become a leading base model for these tasks. However, SAM falters in specialized downstream challenges, leading to various customized SAM models. We introduce BadSAM, a backdoor attack tailored for SAM, revealing that customized models can harbor malicious behaviors. Using the CAMO dataset, we confirm BadSAM's efficacy and identify SAM vulnerabilities. This study paves the way for the development of more secure and customizable vision foundation models. \ No newline at end of file diff --git a/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation b/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation new file mode 100644 index 0000000000..bd640ed7b1 --- /dev/null +++ b/data/2024/aaai/Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation @@ -0,0 +1 @@ +Ensuring the safety of Reinforcement Learning (RL) is crucial for its deployment in real-world applications. Nevertheless, managing the trade-off between reward and safety during exploration presents a significant challenge. Improving reward performance through policy adjustments may adversely affect safety performance. In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation. Initially, we analyze the conflict between reward and safety gradients. Subsequently, we tackle the balance between reward and safety optimization by proposing a soft switching policy optimization method, for which we provide convergence analysis. Based on our theoretical examination, we provide a safe RL framework to overcome the aforementioned challenge, and we develop a Safety-MuJoCo Benchmark to assess the performance of safe RL algorithms. Finally, we evaluate the effectiveness of our method on the Safety-MuJoCo Benchmark and a popular safe benchmark, Omnisafe. Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization. \ No newline at end of file diff --git a/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance b/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance new file mode 100644 index 0000000000..947a403da9 --- /dev/null +++ b/data/2024/aaai/Balancing Humans and Machines: A Study on Integration Scale and Its Impact on Collaborative Performance @@ -0,0 +1 @@ +In the evolving artificial intelligence domain, hybrid human-machine systems have emerged as a transformative research area. While many studies have concentrated on individual human-machine interactions, there is a lack of focus on multi-human and multi-machine dynamics. This paper delves into these nuances by introducing a novel statistical framework that discerns integration accuracy in terms of precision and diversity. Empirical studies reveal that performance surges consistently with scale, either in human or machine settings. However, hybrid systems present complexities. Their performance is intricately tied to the human-to-machine ratio. Interestingly, as the scale expands, integration performance growth isn't limitless. It reaches a threshold influenced by model diversity. This introduces a pivotal `knee point', signifying the optimal balance between performance and scale. This knowledge is vital for resource allocation in practical applications. Grounded in rigorous evaluations using public datasets, our findings emphasize the framework's robustness in refining integrated systems. \ No newline at end of file diff --git a/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection b/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection new file mode 100644 index 0000000000..448ae0902e --- /dev/null +++ b/data/2024/aaai/Barely Supervised Learning for Graph-Based Fraud Detection @@ -0,0 +1 @@ +In recent years, graph-based fraud detection methods have garnered increasing attention for their superior ability to tackle the issue of camouflage in fraudulent scenarios. However, these methods often rely on a substantial proportion of samples as the training set, disregarding the reality of scarce annotated samples in real-life scenarios. As a theoretical framework within semi-supervised learning, the principle of consistency regularization posits that unlabeled samples should be classified into the same category as their own perturbations. Inspired by this principle, this study incorporates unlabeled samples as an auxiliary during model training, designing a novel barely supervised learning method to address the challenge of limited annotated samples in fraud detection. Specifically, to tackle the issue of camouflage in fraudulent scenarios, we employ disentangled representation learning based on edge information for a small subset of annotated nodes. This approach partitions node features into three distinct components representing different connected edges, providing a foundation for the subsequent augmentation of unlabeled samples. For the unlabeled nodes used in auxiliary training, we apply both strong and weak augmentation and design regularization losses to enhance the detection performance of the model in the context of extremely limited labeled samples. Across five publicly available datasets, the proposed model showcases its superior detection capability over baseline models. \ No newline at end of file diff --git a/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss b/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss new file mode 100644 index 0000000000..7be3fe3eea --- /dev/null +++ b/data/2024/aaai/Batch Normalization Is Blind to the First and Second Derivatives of the Loss @@ -0,0 +1 @@ +We prove that when we do the Taylor series expansion of the loss function, the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. We believe that proving the blocking of certain loss terms provides an analytic perspective for potential detects of a deep model with BN operations, although the blocking problem is not fully equivalent to significant damages in all tasks on benchmark datasets. Experiments show that the BN operation significantly affects feature representations in specific tasks. \ No newline at end of file diff --git a/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence b/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence new file mode 100644 index 0000000000..87e4b0617e --- /dev/null +++ b/data/2024/aaai/Bayesian Inference with Complex Knowledge Graph Evidence @@ -0,0 +1 @@ +Knowledge Graphs (KGs) provide a widely used format for representing entities and their relationships and have found use in diverse applications including question answering and recommendation. A majority of current research on KG inference has focused on reasoning with atomic facts (triples) and has disregarded the possibility of making complex evidential observations involving logical operators (negation, conjunction, disjunction) and quantifiers (existential, universal). Further, while the application of complex evidence has been explored in KG-based query answering (KGQA) research, in many practical online settings, observations are made sequentially. For example, in KGQA, additional context may be incrementally suggested to narrow down the answer. Or in interactive recommendation, user critiques may be expressed sequentially in order to narrow down a set of preferred items. Both settings are indicative of information filtering or tracking tasks that are reminiscent of belief tracking in Bayesian inference. In fact, in this paper, we precisely cast the problem of belief tracking over unknown KG entities given incremental complex KG evidence as a Bayesian filtering problem. Specifically, we leverage Knowledge-based Model Construction (KBMC) over the logical KG evidence to instantiate a Markov Random Field (MRF) likelihood representation to perform closed-form Bayesian inference with complex KG evidence (BIKG). We experimentally evaluate BIKG in incremental KGQA and interactive recommendation tasks demonstrating that it outperforms non-incremental methodologies and leads to better incorporation of conjunctive evidence vs. existing complex KGQA methods like CQD that leverage fuzzy T-norm operators. Overall, this work demonstrates a novel, efficient, and unified perspective of logic, KGs, and online inference through the lens of closed-form BIKG. \ No newline at end of file diff --git a/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy b/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy new file mode 100644 index 0000000000..2874c05bb5 --- /dev/null +++ b/data/2024/aaai/Behavioral Recognition of Skeletal Data Based on Targeted Dual Fusion Strategy @@ -0,0 +1 @@ +The deployment of multi-stream fusion strategy on behavioral recognition from skeletal data can extract complementary features from different information streams and improve the recognition accuracy, but suffers from high model complexity and a large number of parameters. Besides, existing multi-stream methods using a fixed adjacency matrix homogenizes the model’s discrimination process across diverse actions, causing reduction of the actual lift for the multi-stream model. Finally, attention mechanisms are commonly applied to the multi-dimensional features, including spatial, temporal and channel dimensions. But their attention scores are typically fused in a concatenated manner, leading to the ignorance of the interrelation between joints in complex actions. To alleviate these issues, the Front-Rear dual Fusion Graph Convolutional Network (FRF-GCN) is proposed to provide a lightweight model based on skeletal data. Targeted adjacency matrices are also designed for different front fusion streams, allowing the model to focus on actions of varying magnitudes. Simultaneously, the mechanism of Spatial-Temporal-Channel Parallel Attention (STC-P), which processes attention in parallel and places greater emphasis on useful information, is proposed to further improve model’s performance. FRF-GCN demonstrates significant competitiveness compared to the current state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton 400 datasets. Our code is available at: https://github.com/sunbeam-kkt/FRF-GCN-master. \ No newline at end of file diff --git a/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change b/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change new file mode 100644 index 0000000000..899e0994da --- /dev/null +++ b/data/2024/aaai/BeliefFlow: A Framework for Logic-Based Belief Diffusion via Iterated Belief Change @@ -0,0 +1 @@ +This paper presents BeliefFlow, a novel framework for representing how logical beliefs spread among interacting agents within a network. In a Belief Flow Network (BFN), agents communicate asynchronously. The agents' beliefs are represented using epistemic states, which encompass their current beliefs and conditional beliefs guiding future changes. When communication occurs between two connected agents, the receiving agent changes its epistemic state using an improvement operator, a well-known type of rational iterated belief change operator that generalizes belief revision operators. We show that BFNs satisfy appealing properties, leading to two significant outcomes. First, in any BFN with strong network connectivity, the beliefs of all agents converge towards a global consensus. Second, within any BFN, we show that it is possible to compute an optimal strategy for influencing the global beliefs. This strategy, which involves controlling the beliefs of a least number of agents through bribery, can be identified from the topology of the network and can be computed in polynomial time. \ No newline at end of file diff --git a/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation b/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation new file mode 100644 index 0000000000..61927d7990 --- /dev/null +++ b/data/2024/aaai/Benchmarking Large Language Models in Retrieval-Augmented Generation @@ -0,0 +1 @@ +Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs. \ No newline at end of file diff --git a/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions b/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions new file mode 100644 index 0000000000..178b02fad8 --- /dev/null +++ b/data/2024/aaai/Benchmarking Large Language Models on Controllable Generation under Diversified Instructions @@ -0,0 +1 @@ +While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval. \ No newline at end of file diff --git a/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract) b/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract) new file mode 100644 index 0000000000..c404f012cc --- /dev/null +++ b/data/2024/aaai/BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract) @@ -0,0 +1 @@ +We present a novel tool BertRLFuzzer, a BERT and Reinforcement Learning (RL) based fuzzer aimed at finding security vulnerabilities for Web applications. BertRLFuzzer works as follows: given a set of seed inputs, the fuzzer performs grammar-adhering and attack-provoking mutation operations on them to generate candidate attack vectors. The key insight of BertRLFuzzer is the use of RL with a BERT model as an agent to guide the fuzzer to efficiently learn grammar-adhering and attack-provoking mutation operators. In order to establish the efficacy of BertRLFuzzer we compare it against a total of 13 black box and white box fuzzers over a benchmark of 9 victim websites with over 16K LOC. We observed a significant improvement, relative to the nearest competing tool in terms of time to first attack (54% less), new vulnerabilities found (17 new vulnerabilities), and attack rate (4.4% more attack vectors generated). \ No newline at end of file diff --git a/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling b/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling new file mode 100644 index 0000000000..86ac6b420d --- /dev/null +++ b/data/2024/aaai/Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling @@ -0,0 +1 @@ +Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18\% top-ranked system recognition accuracy and ranks first or ranks second on 90.91\% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory b/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory new file mode 100644 index 0000000000..ae5bef387a --- /dev/null +++ b/data/2024/aaai/Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory @@ -0,0 +1 @@ +A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding b/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding new file mode 100644 index 0000000000..a8cd58570b --- /dev/null +++ b/data/2024/aaai/Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding @@ -0,0 +1 @@ +Much effort has been devoted to building multi-modal knowledge graphs by visualizing entities on images, but ignoring the multi-modal information of the relation between entities. Hence, in this paper, we aim to construct a new large-scale multi-modal knowledge graph with triplet facts grounded on images that reflect not only entities but also their relations. To achieve this purpose, we propose a novel pipeline method, including triplet fact filtering, image retrieving, entity-based image filtering, relation-based image filtering, and image clustering. In this way, a multi-modal knowledge graph named ImgFact is constructed, which contains 247,732 triplet facts and 3,730,805 images. In experiments, the manual and automatic evaluations prove the reliable quality of our ImgFact. We further use the obtained images to enhance model performance on two tasks. In particular, the model optimized by our ImgFact achieves an impressive 8.38% and 9.87% improvement over the solutions enhanced by an existing multi-modal knowledge graph and VisualChatGPT on F1 of relation classification. We release ImgFact and its instructions at https://github.com/kleinercubs/ImgFact. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms b/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms new file mode 100644 index 0000000000..9b551946e0 --- /dev/null +++ b/data/2024/aaai/Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms @@ -0,0 +1,2 @@ +Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. +We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities b/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities new file mode 100644 index 0000000000..387d303f35 --- /dev/null +++ b/data/2024/aaai/Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities @@ -0,0 +1,4 @@ +Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. + + +In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve \ No newline at end of file diff --git a/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition b/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition new file mode 100644 index 0000000000..5e7b7d626c --- /dev/null +++ b/data/2024/aaai/Beyond Mimicking Under-Represented Emotions: Deep Data Augmentation with Emotional Subspace Constraints for EEG-Based Emotion Recognition @@ -0,0 +1,2 @@ +In recent years, using Electroencephalography (EEG) to recognize emotions has garnered considerable attention. Despite advancements, limited EEG data restricts its potential. Thus, Generative Adversarial Networks (GANs) are proposed to mimic the observed distributions and generate EEG data. However, for imbalanced datasets, GANs struggle to produce reliable augmentations for under-represented minority emotions by merely mimicking them. Thus, we introduce Emotional Subspace Constrained Generative Adversarial Networks (ESC-GAN) as an alternative to existing frameworks. We first propose the EEG editing paradigm, editing reference EEG signals from well-represented to under-represented emotional subspaces. Then, we introduce diversity-aware and +boundary-aware losses to constrain the augmented subspace. Here, the diversity-aware loss encourages a diverse emotional subspace by enlarging the sample difference, while boundary-aware loss constrains the augmented subspace near the decision boundary where recognition models can be vulnerable. Experiments show ESC-GAN boosts emotion recognition performance on benchmark datasets, DEAP, AMIGOS, and SEED, while protecting against potential adversarial attacks. Finally, the proposed method opens new avenues for editing EEG signals under emotional subspace constraints, facilitating unbiased and secure EEG data augmentation. \ No newline at end of file diff --git a/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning b/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning new file mode 100644 index 0000000000..4d16ba78ff --- /dev/null +++ b/data/2024/aaai/Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning @@ -0,0 +1 @@ +Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed data. Although avoiding the time-consuming online interactions in RL, it poses challenges for out-of-distribution (OOD) state actions and often suffers from data inefficiency for training. Despite many efforts being devoted to addressing OOD state actions, the latter (data inefficiency) receives little attention in offline RL. To address this, this paper proposes the cross-domain offline RL, which assumes offline data incorporate additional source-domain data from varying transition dynamics (environments), and expects it to contribute to the offline data efficiency. To do so, we identify a new challenge of OOD transition dynamics, beyond the common OOD state actions issue, when utilizing cross-domain offline data. Then, we propose our method BOSA, which employs two support-constrained objectives to address the above OOD issues. Through extensive experiments in the cross-domain offline RL setting, we demonstrate BOSA can greatly improve offline data efficiency: using only 10% of the target data, BOSA could achieve 74.4% of the SOTA offline RL performance that uses 100% of the target data. Additionally, we also show BOSA can be effortlessly plugged into model-based offline RL and noising data augmentation techniques (used for generating source-domain data), which naturally avoids the potential dynamics mismatch between target-domain data and newly generated source-domain data. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning b/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning new file mode 100644 index 0000000000..93ef15b162 --- /dev/null +++ b/data/2024/aaai/Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning @@ -0,0 +1 @@ +One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. Many outstanding metric-based and prototype-based methods following the Expectation-Maximization paradigm, have been proposed for this objective. However, they inevitably introduce biases into the learning process, particularly with long-tail distributed training data. In this paper, we reveal that the class prototype is not necessarily to be derived from training features and propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. However, the pre-defined anchors may have a large semantic distance from the pixel features, which prevents them from being directly applied. To address this issue and generate feature centroid independent from feature learning, a simple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR ensures the inter-class separability of semantic anchors in the semantic space by employing a classifier-aware auxiliary cross-entropy loss during training via disentanglement learning. By pulling the learned features to these semantic anchors, several advantages can be attained: 1) the intra-class compactness and naturally inter-class separability, 2) induced bias or errors from feature learning can be avoided, and 3) robustness to the long-tailed problem. The proposed SAR can be used in a plug-and-play manner in the existing models. Extensive experiments demonstrate that the SAR performs better than previous sophisticated prototype-based methods. The implementation is available at https://github.com/geyanqi/SAR. \ No newline at end of file diff --git a/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning b/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning new file mode 100644 index 0000000000..d4a48f264b --- /dev/null +++ b/data/2024/aaai/Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning @@ -0,0 +1 @@ +Backdoors on federated learning will be diluted by subsequent benign updates. This is reflected in the significant reduction of attack success rate as iterations increase, ultimately failing. We use a new metric to quantify the degree of this weakened backdoor effect, called attack persistence. Given that research to improve this performance has not been widely noted, we propose a Full Combination Backdoor Attack (FCBA) method. It aggregates more combined trigger information for a more complete backdoor pattern in the global model. Trained backdoored global model is more resilient to benign updates, leading to a higher attack success rate on the test set. We test on three datasets and evaluate with two models across various settings. FCBA's persistence outperforms SOTA federated learning backdoor attacks. On GTSRB, post-attack 120 rounds, our attack success rate rose over 50% from baseline. The core code of our method is available at https://github.com/PhD-TaoLiu/FCBA. \ No newline at end of file diff --git a/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles b/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles new file mode 100644 index 0000000000..f3e06a7c07 --- /dev/null +++ b/data/2024/aaai/Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Tree Ensembles @@ -0,0 +1 @@ +While shallow decision trees may be interpretable, larger ensemble models like gradient-boosted trees, which often set the state of the art in machine learning problems involving tabular data, still remain black box models. As a remedy, the Shapley value (SV) is a well-known concept in explainable artificial intelligence (XAI) research for quantifying additive feature attributions of predictions. The model-specific TreeSHAP methodology solves the exponential complexity for retrieving exact SVs from tree-based models. Expanding beyond individual feature attribution, Shapley interactions reveal the impact of intricate feature interactions of any order. In this work, we present TreeSHAP-IQ, an efficient method to compute any-order additive Shapley interactions for predictions of tree-based models. TreeSHAP-IQ is supported by a mathematical framework that exploits polynomial arithmetic to compute the interaction scores in a single recursive traversal of the tree, akin to Linear TreeSHAP. We apply TreeSHAP-IQ on state-of-the-art tree ensembles and explore interactions on well-established benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation b/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation new file mode 100644 index 0000000000..d745c4567f --- /dev/null +++ b/data/2024/aaai/Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation @@ -0,0 +1 @@ +As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semi-supervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix. \ No newline at end of file diff --git a/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization b/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization new file mode 100644 index 0000000000..d160b8b714 --- /dev/null +++ b/data/2024/aaai/Bi-ViT: Pushing the Limit of Vision Transformer Quantization @@ -0,0 +1 @@ +Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet. Our codes and models are attached on https://github.com/YanjingLi0202/Bi-ViT/ . \ No newline at end of file diff --git a/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking b/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking new file mode 100644 index 0000000000..4be790bab0 --- /dev/null +++ b/data/2024/aaai/Bi-directional Adapter for Multimodal Tracking @@ -0,0 +1,2 @@ +Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging +sensor, multi-modal images (RGB, infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen, pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT. \ No newline at end of file diff --git a/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video b/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video new file mode 100644 index 0000000000..d568bdc9ba --- /dev/null +++ b/data/2024/aaai/Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video @@ -0,0 +1 @@ +Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD. \ No newline at end of file diff --git a/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract) b/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract) new file mode 100644 index 0000000000..fa658cd561 --- /dev/null +++ b/data/2024/aaai/Biases Mitigation and Expressiveness Preservation in Language Models: A Comprehensive Pipeline (Student Abstract) @@ -0,0 +1 @@ +Pre-trained language models (PLMs) have greatly transformed various downstream tasks, yet frequently display social biases from training data, raising fairness concerns. Recent efforts to debias PLMs come with limitations: they either fine-tune the entire parameters in PLMs, which is time-consuming and disregards the expressiveness of PLMs, or ignore the reintroducing biases from downstream tasks when applying debiased models to them. Hence, we propose a two-stage pipeline to mitigate biases from both internal and downstream contexts while preserving expressiveness in language models. Specifically, for the debiasing procedure, we resort to continuous prefix-tuning, not fully fine-tuning the PLM, in which we design a debiasing term for optimization and an alignment term to keep words’ relative distances and ensure the model's expressiveness. For downstream tasks, we perform causal intervention across different demographic groups for invariant predictions. Results on three GLUE tasks show our method alleviates biases from internal and downstream contexts, while keeping PLM expressiveness intact. \ No newline at end of file diff --git a/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering b/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering new file mode 100644 index 0000000000..6204be3cbd --- /dev/null +++ b/data/2024/aaai/Bidirectional Contrastive Split Learning for Visual Question Answering @@ -0,0 +1 @@ +Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module, leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows significantly enhanced resilience when exposed to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning. \ No newline at end of file diff --git a/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution b/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution new file mode 100644 index 0000000000..d607ce4f6d --- /dev/null +++ b/data/2024/aaai/Bidirectional Temporal Plan Graph: Enabling Switchable Passing Orders for More Efficient Multi-Agent Path Finding Plan Execution @@ -0,0 +1 @@ +The Multi-Agent Path Finding (MAPF) problem involves planning collision-free paths for multiple agents in a shared environment. The majority of MAPF solvers rely on the assumption that an agent can arrive at a specific location at a specific timestep. However, real-world execution uncertainties can cause agents to deviate from this assumption, leading to collisions and deadlocks. Prior research solves this problem by having agents follow a Temporal Plan Graph (TPG), enforcing a consistent passing order at every location as defined in the MAPF plan. However, we show that TPGs are overly strict because, in some circumstances, satisfying the passing order requires agents to wait unnecessarily, leading to longer execution time. To overcome this issue, we introduce a new graphical representation called a Bidirectional Temporal Plan Graph (BTPG), which allows switching passing orders during execution to avoid unnecessary waiting time. We design two anytime algorithms for constructing a BTPG: BTPG-naïve and BTPG-optimized. Experimental results show that following BTPGs consistently outperforms following TPGs, reducing unnecessary waits by 8-20%. \ No newline at end of file diff --git a/data/2024/aaai/Big Learning Expectation Maximization b/data/2024/aaai/Big Learning Expectation Maximization new file mode 100644 index 0000000000..a8105c0bce --- /dev/null +++ b/data/2024/aaai/Big Learning Expectation Maximization @@ -0,0 +1 @@ +Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer from bad local optima that could be arbitrarily worse than the optimal. To address the long-lasting bad-local-optima challenge, we draw inspiration from the recent ground-breaking foundation models and propose to leverage their underlying big learning principle to upgrade the EM. Specifically, we present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings between data and model distributions. Through simulated experiments, we empirically show that the BigLearn-EM is capable of delivering the optimal with high probability; comparisons on benchmark clustering datasets further demonstrate its effectiveness and advantages over existing techniques. The code is available at https://github.com/YulaiCong/Big-Learning-Expectation-Maximization. \ No newline at end of file diff --git a/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation b/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation new file mode 100644 index 0000000000..859664de7d --- /dev/null +++ b/data/2024/aaai/Bilateral Gradual Semantics for Weighted Argumentation @@ -0,0 +1 @@ +Abstract argumentation is a reasoning model for evaluating arguments. Recently, gradual semantics has received considerable attention in weighted argumentation, which assigns an acceptability degree to each argument as its strength. In this paper, we aim to enhance gradual semantics by non-reciprocally incorporating the notion of rejectability degree. Such a setting offers a bilateral perspective on argument strength, enabling more comprehensive argument evaluations in practical situations. To this end, we first provide a set of principles for our semantics, taking both the acceptability and rejectability degrees into account, and propose three novel semantics conforming to the above principles. These semantics are defined as the limits of iterative sequences that always converge in any given weighted argumentation system, making them preferable for real-world applications. \ No newline at end of file diff --git a/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract) b/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract) new file mode 100644 index 0000000000..6597d43165 --- /dev/null +++ b/data/2024/aaai/Biomedical Knowledge Graph Embedding with Householder Projection (Student Abstract) @@ -0,0 +1 @@ +Researchers have applied knowledge graph embedding (KGE) techniques with advanced neural network techniques, such as capsule networks, for predicting drug-drug interactions (DDIs) and achieved remarkable results. However, most ignore molecular structure and position features between drug pairs. They cannot model the biomedical field's significant relational mapping properties (RMPs,1-N, N-1, N-N) relation. To solve these problems, we innovatively propose CDHse that consists of two crucial modules: 1) Entity embedding module, we obtain position feature obtained by PubMedBERT and Convolutional Neural Network (CNN), obtain molecular structure feature with Graphic Nuaral Network (GNN), obtain entity embedding feature of drug pairs, and then incorporate these features into one synthetic feature. 2) Knowledge graph embedding module, the synthetic feature is Householder projections and then embedded in the complex vector space for training. In this paper, we have selected several advanced models for the DDIs task and performed experiments on three standard BioKG to validate the effectiveness of CDHse. \ No newline at end of file diff --git a/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes b/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes new file mode 100644 index 0000000000..c492564ace --- /dev/null +++ b/data/2024/aaai/BirdCollect: A Comprehensive Benchmark for Analyzing Dense Bird Flock Attributes @@ -0,0 +1 @@ +Automatic recognition of bird behavior from long-term, un controlled outdoor imagery can contribute to conservation efforts by enabling large-scale monitoring of bird populations. Current techniques in AI-based wildlife monitoring have focused on short-term tracking and monitoring birds individually rather than in species-rich flocks. We present Bird-Collect, a comprehensive benchmark dataset for monitoring dense bird flock attributes. It includes a unique collection of more than 6,000 high-resolution images of Demoiselle Cranes (Anthropoides virgo) feeding and nesting in the vicinity of Khichan region of Rajasthan. Particularly, each image contains an average of 190 individual birds, illustrating the complex dynamics of densely populated bird flocks on a scale that has not previously been studied. In addition, a total of 433 distinct pictures captured at Keoladeo National Park, Bharatpur provide a comprehensive representation of 34 distinct bird species belonging to various taxonomic groups. These images offer details into the diversity and the behaviour of birds in vital natural ecosystem along the migratory flyways. Additionally, we provide a set of 2,500 point-annotated samples which serve as ground truth for benchmarking various computer vision tasks like crowd counting, density estimation, segmentation, and species classification. The benchmark performance for these tasks highlight the need for tailored approaches for specific wildlife applications, which include varied conditions including views, illumination, and resolutions. With around 46.2 GBs in size encompassing data collected from two distinct nesting ground sets, it is the largest birds dataset containing detailed annotations, showcasing a substantial leap in bird research possibilities. We intend to publicly release the dataset to the research community. The database is available at: https://iab-rubric.org/resources/wildlife-dataset/birdcollect \ No newline at end of file diff --git a/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery b/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery new file mode 100644 index 0000000000..1ad8fc4eca --- /dev/null +++ b/data/2024/aaai/Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery @@ -0,0 +1 @@ +Blind face restoration under extreme conditions involves reconstructing high-quality face images from severely degraded inputs. These input images are often in poor quality and have extreme facial poses, leading to errors in facial structure and unnatural artifacts within the restored images. In this paper, we show that utilizing 3D priors effectively compensates for structure knowledge deficiencies in 2D priors while preserving the texture details. Based on this, we introduce FREx (Face Restoration under Extreme conditions) that combines structure-accurate 3D priors and texture-rich 2D priors in pretrained generative networks for blind face restoration under extreme conditions. To fuse the different information in 3D and 2D priors, we introduce an adaptive weight module that adjusts the importance of features based on the input image's condition. With this approach, our model can restore structure-accurate and natural-looking faces even when the images have lost a lot of information due to degradation and extreme pose. Extensive experimental results on synthetic and real-world datasets validate the effectiveness of our methods. \ No newline at end of file diff --git a/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication b/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication new file mode 100644 index 0000000000..be653f356b --- /dev/null +++ b/data/2024/aaai/Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication @@ -0,0 +1 @@ +Fingerprint authentication is a popular security mechanism for smartphones and laptops. However, its adoption in web and cloud environments has been limited due to privacy concerns over storing and processing biometric data on servers. This paper introduces Blind-Touch, a novel machine learning-based fingerprint authentication system leveraging homomorphic encryption to address these privacy concerns. Homomorphic encryption allows computations on encrypted data without decrypting. Thus, Blind-Touch can keep fingerprint data encrypted on the server while performing machine learning operations. Blind-Touch combines three strategies to efficiently utilize homomorphic encryption in machine learning: (1) It optimizes the feature vector for a distributed architecture, processing the first fully connected layer (FC-16) in plaintext on the client side and the subsequent layer (FC-1) post-encryption on the server, thereby minimizing encrypted computations; (2) It employs a homomorphic encryption-compatible data compression technique capable of handling 8,192 authentication results concurrently; and (3) It utilizes a clustered server architecture to simultaneously process authentication results, thereby enhancing scalability with increasing user numbers. Blind-Touch achieves high accuracy on two benchmark fingerprint datasets, with a 93.6% F1- score for the PolyU dataset and a 98.2% F1-score for the SOKOTO dataset. Moreover, Blind-Touch can match a fingerprint among 5,000 in about 0.65 seconds. With its privacy-focused design, high accuracy, and efficiency, Blind-Touch is a promising alternative to conventional fingerprint authentication for web and cloud applications. \ No newline at end of file diff --git a/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction b/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction new file mode 100644 index 0000000000..b3d6a56576 --- /dev/null +++ b/data/2024/aaai/Block Image Compressive Sensing with Local and Global Information Interaction @@ -0,0 +1,9 @@ +Block image compressive sensing methods, which divide a single image into small blocks for efficient sampling and reconstruction, have achieved significant success. +However, these methods process each block locally and thus disregard the global communication among different blocks in the reconstruction step. +Existing methods have attempted to address this issue with local filters or by directly reconstructing the entire image, but they have only achieved insufficient communication among adjacent pixels or bypassed the problem. +To directly confront the communication problem among blocks and effectively resolve it, we propose a novel approach called Block Reconstruction with Blocks' Communication Network (BRBCN). +BRBCN focuses on both local and global information, while further taking their interactions into account. +Specifically, BRBCN comprises dual CNN and Transformer architectures, in which CNN is used to reconstruct each block for powerful local processing and Transformer is used to calculate the global communication among all the blocks. +Moreover, we propose a global-to-local module (G2L) and a local-to-global module (L2G) to effectively integrate the representations of CNN and Transformer, with which our BRBCN network realizes the bidirectional interaction between local and global information. +Extensive experiments show our BRBCN method outperforms existing state-of-the-art methods by a large margin. +The code is available at https://github.com/kongxiuxiu/BRBCN \ No newline at end of file diff --git a/data/2024/aaai/Block-Level Goal Recognition Design b/data/2024/aaai/Block-Level Goal Recognition Design new file mode 100644 index 0000000000..27a658e77e --- /dev/null +++ b/data/2024/aaai/Block-Level Goal Recognition Design @@ -0,0 +1 @@ +Existing works on goal recognition design (GRD) consider the underlying domain as a classical planning domain and apply modifications to the domain to minimize the worst case distinctiveness. In this paper, we propose replacing existing modifications with blocks, which group several closely related modifications together such that a block can modify a region in a search space with respect to some design constraints. Moreover, there could be blocks within blocks such that the design space becomes hierarchical for modifications at different levels of granularity. We present 1) a new version of pruned-reduce, a successful pruning rule for GRD, for block-level GRD, and 2) a new pruning rule for pruning some branches in both hierarchical and non-hierarchical design space. Our experiments show that searching in hierarchical design spaces greatly speeds up the redesign process. \ No newline at end of file diff --git a/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping b/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping new file mode 100644 index 0000000000..6f781c7c2d --- /dev/null +++ b/data/2024/aaai/Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping @@ -0,0 +1 @@ +Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems. To address this problem, many transferability enhancement approaches (e.g., input transformation and model augmentation) have been proposed. However, they show poor performances in attacking systems having different model genera from the surrogate model. In this paper, we propose a novel and generic attacking strategy, called Deformation-Constrained Warping Attack (DeCoWA), that can be effectively applied to cross model genus attack. Specifically, DeCoWA firstly augments input examples via an elastic deformation, namely Deformation-Constrained Warping (DeCoW), to obtain rich local details of the augmented input. To avoid severe distortion of global semantics led by random deformation, DeCoW further constrains the strength and direction of the warping transformation by a novel adaptive control strategy. Extensive experiments demonstrate that the transferable examples crafted by our DeCoWA on CNN surrogates can significantly hinder the performance of Transformers (and vice versa) on various tasks, including image classification, video action recognition, and audio recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA. \ No newline at end of file diff --git a/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization b/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization new file mode 100644 index 0000000000..2ef1398135 --- /dev/null +++ b/data/2024/aaai/Boosting Few-Shot Learning via Attentive Feature Regularization @@ -0,0 +1 @@ +Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor. However, this mixing operation weakens the feature representation due to the linear interpolation and the overlooking of the importance of specific channels. To solve these issues, this paper proposes attentive feature regularization (AFR) which aims to improve the feature representativeness and discriminability. In our approach, we first calculate the relations between different categories of semantic labels to pick out the related features used for regularization. Then, we design two attention-based calculations at both the instance and channel levels. These calculations enable the regularization procedure to focus on two crucial aspects: the feature complementarity through adaptive interpolation in related categories and the emphasis on specific feature channels. Finally, we combine these regularization strategies to significantly improve the classifier performance. Empirical studies on several popular FSL benchmarks demonstrate the effectiveness of AFR, which improves the recognition accuracy of novel categories without the need to retrain any feature extractor, especially in the 1-shot setting. Furthermore, the proposed AFR can seamlessly integrate into other FSL methods to improve classification performance. \ No newline at end of file diff --git a/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference b/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference new file mode 100644 index 0000000000..efc56712a2 --- /dev/null +++ b/data/2024/aaai/Boosting Multiple Instance Learning Models for Whole Slide Image Classification: A Model-Agnostic Framework Based on Counterfactual Inference @@ -0,0 +1 @@ +Multiple instance learning is an effective paradigm for whole slide image (WSI) classification, where labels are only provided at the bag level. However, instance-level prediction is also crucial as it offers insights into fine-grained regions of interest. Existing multiple instance learning methods either solely focus on training a bag classifier or have the insufficient capability of exploring instance prediction. In this work, we propose a novel model-agnostic framework to boost existing multiple instance learning models, to improve the WSI classification performance in both bag and instance levels. Specifically, we propose a counterfactual inference-based sub-bag assessment method and a hierarchical instance searching strategy to help to search reliable instances and obtain their accurate pseudo labels. Furthermore, an instance classifier is well-trained to produce accurate predictions. The instance embedding it generates is treated as a prompt to refine the instance feature for bag prediction. This framework is model-agnostic, capable of adapting to existing multiple instance learning models, including those without specific mechanisms like attention. Extensive experiments on three datasets demonstrate the competitive performance of our method. Code will be available at https://github.com/centurion-crawler/CIMIL. \ No newline at end of file diff --git a/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling b/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling new file mode 100644 index 0000000000..42b7bcecd1 --- /dev/null +++ b/data/2024/aaai/Boosting Neural Cognitive Diagnosis with Student's Affective State Modeling @@ -0,0 +1 @@ +Cognitive Diagnosis Modeling aims to infer students' proficiency level on knowledge concepts from their response logs. Existing methods typically model students’ response processes as the interaction between students and exercises or concepts based on hand-crafted or deeply-learned interaction functions. Despite their promising achievements, they fail to consider the relationship between students' cognitive states and affective states in learning, e.g., the feelings of frustration, boredom, or confusion with the learning content, which is insufficient for comprehensive cognitive diagnosis in intelligent education. To fill the research gap, we propose a novel Affect-aware Cognitive Diagnosis (ACD) model which can effectively diagnose the knowledge proficiency levels of students by taking into consideration the affective factors. Specifically, we first design a student affect perception module under the assumption that the affective state is jointly influenced by the student's affect trait and the difficulty of the exercise. Then, our inferred affective distribution is further used to estimate the student's subjective factors, i.e., guessing and slipping, respectively. Finally, we integrate the estimated guessing and slipping parameters with the basic neural cognitive diagnosis framework based on the DINA model, which facilitates the modeling of complex exercising interactions in a more accurate and interpretable fashion. Besides, we also extend our affect perception module in an unsupervised learning setting based on contrastive learning, thus significantly improving the compatibility of our ACD. To the best of our knowledge, we are the first to unify the cognition modeling and affect modeling into the same framework for student cognitive diagnosis. Extensive experiments on real-world datasets clearly demonstrate the effectiveness of our ACD. Our code is available at https://github.com/zeng-zhen/ACD. \ No newline at end of file diff --git a/data/2024/aaai/Boosting Residual Networks with Group Knowledge b/data/2024/aaai/Boosting Residual Networks with Group Knowledge new file mode 100644 index 0000000000..1c11f2ab0a --- /dev/null +++ b/data/2024/aaai/Boosting Residual Networks with Group Knowledge @@ -0,0 +1 @@ +Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnet-in-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet group. Meanwhile, we also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT. \ No newline at end of file diff --git a/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model b/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model new file mode 100644 index 0000000000..d6ce87d10b --- /dev/null +++ b/data/2024/aaai/Bootstrapping Cognitive Agents with a Large Language Model @@ -0,0 +1 @@ +Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. In contrast cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrapping a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent entirely based on large language models. Our experiments also indicate that the cognitive agent bootstrapped using this framework can generalize to novel environments and be scaled to complex tasks. \ No newline at end of file diff --git a/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation b/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation new file mode 100644 index 0000000000..aee8c6265f --- /dev/null +++ b/data/2024/aaai/Bootstrapping Large Language Models for Radiology Report Generation @@ -0,0 +1 @@ +Radiology report generation (RRG) aims to automatically generate a free-text description from a specific clinical radiograph, e.g., chest X-Ray images. Existing approaches tend to perform RRG with specific models trained on the public yet limited data from scratch, where they often lead to inferior performance owing to the problem of inefficient capabilities in both aligning visual and textual features and generating informative reports accordingly. Currently, large language models (LLMs) offered a promising solution to text generation with their power in learning from big data, especially for cross-modal scenarios such as RRG. However, most existing LLMs are pre-trained on general data, and suffer from the same problem of conventional approaches caused by knowledge gap between general and medical domain if they are applied to RRG. Therefore in this paper, we propose an approach to bootstrapping LLMs for RRG with a in-domain instance induction and a coarse-to-fine decoding process. Specifically, the in-domain instance induction process learns to align the LLM to radiology reports from general texts through contrastive learning. The coarse-to-fine decoding performs a text elevating process for those reports from the ranker, further enhanced with visual features and refinement prompts. Experimental results on two prevailing RRG datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate the superiority of our approach to previous state-of-the-art solutions. Further analyses illustrate that, for the LLM, the induction process enables it to better align with the medical domain and the coarse-to-fine generation allows it to conduct more precise text generation. \ No newline at end of file diff --git a/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text b/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text new file mode 100644 index 0000000000..428a816881 --- /dev/null +++ b/data/2024/aaai/Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text @@ -0,0 +1 @@ +Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency ( > 50% less vs. the SOTA method DPText-DETR) and reduces inference speed (> 40% less vs. DPText-DETR) with comparable performance on benchmarks. The code is available at https://github.com/Albertchen98/Box2Poly.git. \ No newline at end of file diff --git a/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA b/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA new file mode 100644 index 0000000000..6066a7dcfd --- /dev/null +++ b/data/2024/aaai/Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA @@ -0,0 +1 @@ +In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at https://github.com/matthewdm0816/BridgeQA. \ No newline at end of file diff --git a/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract) b/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract) new file mode 100644 index 0000000000..026cb5e44b --- /dev/null +++ b/data/2024/aaai/Bridging the Gap between Source Code and Requirements Using GPT (Student Abstract) @@ -0,0 +1 @@ +Reverse engineering involves analyzing the design, architecture, and functionality of systems, and is crucial for legacy systems. Legacy systems are outdated software systems that are still in use and often lack proper documentation, which makes their maintenance and evolution challenging. To address this, we introduce SC2Req, utilizing the Generative Pre-trained Transformer (GPT) for automated code analysis and requirement generation. This approach aims to convert source code into understandable requirements and bridge the gap between those two. Through experiments on diverse software projects, SC2Req shows the potential to enhance the accuracy and efficiency of the translation process. This approach not only facilitates faster software development and easier maintenance of legacy systems but also lays a strong foundation for future research, promoting better understanding and communication in software development. \ No newline at end of file diff --git a/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need b/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need new file mode 100644 index 0000000000..ba53eab67b --- /dev/null +++ b/data/2024/aaai/Bridging the Semantic Latent Space between Brain and Machine: Similarity Is All You Need @@ -0,0 +1 @@ +How our brain encodes complex concepts has been a longstanding mystery in neuroscience. The answer to this problem can lead to new understandings about how the brain retrieves information in large-scale data with high efficiency and robustness. Neuroscience studies suggest the brain represents concepts in a locality-sensitive hashing (LSH) strategy, i.e., similar concepts will be represented by similar responses. This finding has inspired the design of similarity-based algorithms, especially in contrastive learning. Here, we hypothesize that the brain and large neural network models, both using similarity-based learning rules, could contain a similar semantic embedding space. To verify that, this paper proposes a functional Magnetic Resonance Imaging (fMRI) semantic learning network named BrainSem, aimed at seeking a joint semantic latent space that bridges the brain and a Contrastive Language-Image Pre-training (CLIP) model. Given that our perception is inherently cross-modal, we introduce a fuzzy (one-to-many) matching loss function to encourage the models to extract high-level semantic components from neural signals. Our results claimed that using only a small set of fMRI recordings for semantic space alignment, we could obtain shared embedding valid for unseen categories out of the training set, which provided potential evidence for the semantic representation similarity between the brain and large neural networks. In a zero-shot classification task, our BrainSem achieves an 11.6% improvement over the state-of-the-art. \ No newline at end of file diff --git a/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model b/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model new file mode 100644 index 0000000000..1b4c7b3085 --- /dev/null +++ b/data/2024/aaai/Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model @@ -0,0 +1 @@ +Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending. \ No newline at end of file diff --git a/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education b/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education new file mode 100644 index 0000000000..462c35ac66 --- /dev/null +++ b/data/2024/aaai/Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education @@ -0,0 +1 @@ +As artificial intelligence (AI) is playing an increasingly important role in our society and global economy, AI education and literacy have become necessary components in college and K-12 education to prepare students for an AI-powered society. However, current AI curricula have not yet been made accessible and engaging enough for students and schools from all socio-economic backgrounds with different educational goals. In this work, we developed an open-source learning module for college and high school students, which allows students to build their own robot companion from the ground up. This open platform can be used to provide hands-on experience and introductory knowledge about various aspects of AI, including robotics, machine learning (ML), software engineering, and mechanical engineering. Because of the social and personal nature of a socially assistive robot companion, this module also puts a special emphasis on human-centered AI, enabling students to develop a better understanding of human-AI interaction and AI ethics through hands-on learning activities. With open-source documentation, assembling manuals and affordable materials, students from different socio-economic backgrounds can personalize their learning experience based on their individual educational goals. To evaluate the student-perceived quality of our module, we conducted a usability testing workshop with 15 college students recruited from a minority-serving institution. Our results indicate that our AI module is effective, easy-to-follow, and engaging, and it increases student interest in studying AI/ML and robotics in the future. We hope that this work will contribute toward accessible and engaging AI education in human-AI interaction for college and high school students. \ No newline at end of file diff --git a/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs b/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs new file mode 100644 index 0000000000..48cfa26d9f --- /dev/null +++ b/data/2024/aaai/Building Conversational Artifacts to Enable Digital Assistant for APIs and RPAs @@ -0,0 +1 @@ +In the realm of business automation, digital assistants/chatbots are emerging as the primary method for making automation software accessible to users in various business sectors. Access to automation primarily occurs through APIs and RPAs. To effectively convert APIs and RPAs into chatbots on a larger scale, it is crucial to establish an automated process for generating data and training models that can recognize user intentions, identify questions for conversational slot filling, and provide recommendations for subsequent actions. In this paper, we present a technique for enhancing and generating natural language conversational artifacts from API specifications using large language models (LLMs). The goal is to utilize LLMs in the "build" phase to assist humans in creating skills for digital assistants. As a result, the system doesn't need to rely on LLMs during conversations with business users, leading to efficient deployment. Experimental results highlight the effectiveness of our proposed approach. Our system is deployed in the IBM Watson Orchestrate product for general availability. \ No newline at end of file diff --git a/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems b/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems new file mode 100644 index 0000000000..b464718f74 --- /dev/null +++ b/data/2024/aaai/Building Higher-Order Abstractions from the Components of Recommender Systems @@ -0,0 +1 @@ +We present a modular recommender system framework that tightly integrates yet maintains the independence of individual components, thus satisfying two of the most critical aspects of industrial applications, generality and specificity. On the one hand, we ensure that each component remains self-contained and is ready to serve in other applications beyond recommender systems. On the other hand, when these components are combined, a unified theme emerges for recommender systems. We present the details of each component in the context of recommender systems and other applications. We release each component as an open-source library, and most importantly, we release their integration under MAB2REC, an industry-strength open-source software for building bandit-based recommender systems. By bringing standalone components together, Mab2Rec realizes a powerful and scalable toolchain to build and deploy business-relevant personalization applications. Finally, we share our experience and best practices for user training, adoption, performance evaluation, deployment, and model governance within the enterprise and the broader community. \ No newline at end of file diff --git a/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning b/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning new file mode 100644 index 0000000000..032fe0e9b5 --- /dev/null +++ b/data/2024/aaai/Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning @@ -0,0 +1,5 @@ +Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. +In factored state spaces, one approach towards achieving both goals is to learn state abstractions, which only keep the necessary variables for learning the tasks at hand. +This paper introduces Causal Bisimulation Modeling (CBM), a method that learns the causal relationships in the dynamics and reward functions for each task to derive a minimal, task-specific abstraction. +CBM leverages and improves implicit modeling to train a high-fidelity causal dynamics model that can be reused for all tasks in the same environment. +Empirical validation on two manipulation environments and four tasks reveals that CBM's learned implicit dynamics models identify the underlying causal relationships and state abstractions more accurately than explicit ones. Furthermore, the derived state abstractions allow a task learner to achieve near-oracle levels of sample efficiency and outperform baselines on all tasks. \ No newline at end of file diff --git a/data/2024/aaai/Building Variable-Sized Models via Learngene Pool b/data/2024/aaai/Building Variable-Sized Models via Learngene Pool new file mode 100644 index 0000000000..dd859b057f --- /dev/null +++ b/data/2024/aaai/Building Variable-Sized Models via Learngene Pool @@ -0,0 +1,2 @@ +Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. +To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pre-trained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net. \ No newline at end of file diff --git a/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation b/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation new file mode 100644 index 0000000000..b2fcbc7f49 --- /dev/null +++ b/data/2024/aaai/Byzantine-Robust Decentralized Learning via Remove-then-Clip Aggregation @@ -0,0 +1,4 @@ +We consider decentralized learning over a network of workers with heterogeneous datasets, in the presence of Byzantine workers. +Byzantine workers may transmit arbitrary or malicious values to neighboring workers, leading to degradation in overall performance. The heterogeneous nature of the training data across various workers complicates the identification and mitigation of Byzantine workers. +To address this complex problem, we introduce a resilient decentralized learning approach that combines the gradient descent algorithm with a novel robust aggregator. Specifically, we propose a remove-then-clip aggregator, whereby each benign worker meticulously filters the neighbors' values and subsequently projects the remaining values to a sphere centered at its local value, with an appropriately selected radius. +We prove that our proposed method converges to a neighborhood of a stationary point for non-convex objectives under standard assumptions. Furthermore, empirical evaluations are provided to demonstrate the superior performance of our method in comparison to existing algorithms, under various Byzantine attack models. \ No newline at end of file diff --git a/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition b/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition new file mode 100644 index 0000000000..364cf77163 --- /dev/null +++ b/data/2024/aaai/CAMEL: Capturing Metaphorical Alignment with Context Disentangling for Multimodal Emotion Recognition @@ -0,0 +1 @@ +Understanding the emotional polarity of multimodal content with metaphorical characteristics, such as memes, poses a significant challenge in Multimodal Emotion Recognition (MER). Previous MER researches have overlooked the phenomenon of metaphorical alignment in multimedia content, which involves non-literal associations between concepts to convey implicit emotional tones. Metaphor-agnostic MER methods may be misinformed by the isolated unimodal emotions, which are distinct from the real emotions blended in multimodal metaphors. Moreover, contextual semantics can further affect the emotions associated with similar metaphors, leading to the challenge of maintaining contextual compatibility. To address the issue of metaphorical alignment in MER, we propose to leverage a conditional generative approach for capturing metaphorical analogies. Our approach formulates schematic prompts and corresponding references based on theoretical foundations, which allows the model to better grasp metaphorical nuances. In order to maintain contextual sensitivity, we incorporate a disentangled contrastive matching mechanism, which undergoes curricular adjustment to regulate its intensity during the learning process. The automatic and human evaluation experiments on two benchmarks prove that, our model provides considerable and stable improvements in recognizing multimodal emotion with metaphor attributes. \ No newline at end of file diff --git a/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization b/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization new file mode 100644 index 0000000000..59cc843b9a --- /dev/null +++ b/data/2024/aaai/CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization @@ -0,0 +1 @@ +Cross-Lingual Summarization (CLS) involves generating a summary for a given document in another language. Most of the existing approaches adopt multi-task training and knowledge distillation, which increases the training cost and improves the performance of CLS tasks intuitively but unexplainably. In this work, we propose Cross-Attention Reinforcement (CAR) module and incorporate the module into the transformer backbone to formulate the CAR-Transformer. The CAR module formulates a pseudo summarization policy parameterized by the cross-attention weights reinforced by the ground-truth monolingual summary without introducing extra model parameters. Our approach demonstrates more consistent improvement across CLS tasks compared to traditional multi-task training methods and outperforms the fine-tuned vanilla mBART by 3.67 and the best-performing multi-task training approach by 1.48 in ROUGE-L F1 score on the WikiLingua Korean-to-English CLS task. \ No newline at end of file diff --git a/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition b/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition new file mode 100644 index 0000000000..e6ce31557d --- /dev/null +++ b/data/2024/aaai/CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition @@ -0,0 +1 @@ +Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT. \ No newline at end of file diff --git a/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection b/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection new file mode 100644 index 0000000000..551e0f43ca --- /dev/null +++ b/data/2024/aaai/CASE: Exploiting Intra-class Compactness and Inter-class Separability of Feature Embeddings for Out-of-Distribution Detection @@ -0,0 +1 @@ +Detecting out-of-distribution (OOD) inputs is critical for reliable machine learning, but deep neural networks often make overconfident predictions, even for OOD inputs that deviate from the distribution of training data. Prior methods relied on the widely used softmax cross-entropy (CE) loss that is adequate for classifying in-distribution (ID) samples but not optimally designed for OOD detection. To address this issue, we propose CASE, a simple and effective OOD detection method by explicitly improving intra-class Compactness And inter-class Separability of feature Embeddings. To enhance the separation between ID and OOD samples, CASE uses a dual-loss framework, which includes a separability loss that maximizes the inter-class Euclidean distance to promote separability among different class centers, along with a compactness loss that minimizes the intra-class Euclidean distance to encourage samples to be close to their class centers. In particular, the class centers are defined as a free optimization parameter of the model and updated by gradient descent, which is simple and further enhances the OOD detection performance. Extensive experiments demonstrate the superiority of CASE, which reduces the average FPR95 by 37.11% and improves the average AUROC by 15.89% compared to the baseline method using a softmax confidence score on the more challenging CIFAR-100 model. \ No newline at end of file diff --git a/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments b/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments new file mode 100644 index 0000000000..faea51f845 --- /dev/null +++ b/data/2024/aaai/CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments @@ -0,0 +1 @@ +Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction. \ No newline at end of file diff --git a/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving b/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving new file mode 100644 index 0000000000..53d64bb930 --- /dev/null +++ b/data/2024/aaai/CCTR: Calibrating Trajectory Prediction for Uncertainty-Aware Motion Planning in Autonomous Driving @@ -0,0 +1 @@ +Autonomous driving systems rely on precise trajectory prediction for safe and efficient motion planning. Despite considerable efforts to enhance prediction accuracy, inherent uncertainties persist due to data noise and incomplete observations. Many strategies entail formalizing prediction outcomes into distributions and utilizing variance to represent uncertainty. However, our experimental investigation reveals that existing trajectory prediction models yield unreliable uncertainty estimates, necessitating additional customized calibration processes. On the other hand, directly applying current calibration techniques to prediction outputs may yield sub-optimal results due to using a universal scaler for all predictions and neglecting informative data cues. In this paper, we propose Customized Calibration Temperature with Regularizer (CCTR), a generic framework that calibrates the output distribution. Specifically, CCTR 1) employs a calibration-based regularizer to align output variance with the discrepancy between prediction and ground truth and 2) generates a tailor-made temperature scaler for each prediction using a post-processing network guided by context and historical information. Extensive evaluation involving multiple prediction and planning methods demonstrates the superiority of CCTR over existing calibration algorithms and uncertainty-aware methods, with significant improvements of 11%-22% in calibration quality and 17%-46% in motion planning. \ No newline at end of file diff --git a/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion b/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion new file mode 100644 index 0000000000..7361ad734d --- /dev/null +++ b/data/2024/aaai/CDPNet: Cross-Modal Dual Phases Network for Point Cloud Completion @@ -0,0 +1 @@ +Point cloud completion aims at completing shapes from their partial. Most existing methods utilized shape’s priors information for point cloud completion, such as inputting the partial and getting the complete one through an encoder-decoder deep learning structure. However, it is very often to easily cause the loss of information in the generation process because of the invisibility of missing areas. Unlike most existing methods directly inferring the missing points using shape priors, we address it as a cross-modality task. We propose a new Cross-modal Dual Phases Network (CDPNet) for shape completion. Our key idea is that the global information of the shape is obtained from the extra single-view image, and the partial point clouds provide the geometric information. After that, the multi-modal features jointly guide the specific structural information. To learn the geometric details of the shape, we chose to use patches to preserve the local geometric feature. In this way, we can generate shapes with enough geometric details. Experimental results show that our method achieves state-of-the-art performance on point cloud completion. \ No newline at end of file diff --git a/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation b/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation new file mode 100644 index 0000000000..f01bf8ec9b --- /dev/null +++ b/data/2024/aaai/CEDFlow: Latent Contour Enhancement for Dark Optical Flow Estimation @@ -0,0 +1 @@ +Accurately computing optical flow in low-contrast and noisy dark images is challenging, especially when contour information is degraded or difficult to extract. This paper proposes CEDFlow, a latent space contour enhancement for estimating optical flow in dark environments. By leveraging spatial frequency feature decomposition, CEDFlow effectively encodes local and global motion features. Importantly, we introduce the 2nd-order Gaussian difference operation to select salient contour features in the latent space precisely. It is specifically designed for large-scale contour components essential in dark optical flow estimation. Experimental results on the FCDN and VBOF datasets demonstrate that CEDFlow outperforms state-of-the-art methods in terms of the EPE index and produces more accurate and robust flow estimation. Our code is available at: https://github.com/xautstuzfy. \ No newline at end of file diff --git a/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems b/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems new file mode 100644 index 0000000000..8f3e864321 --- /dev/null +++ b/data/2024/aaai/CEGAR-Based Approach for Solving Combinatorial Optimization Modulo Quantified Linear Arithmetics Problems @@ -0,0 +1 @@ +Bioinformatics has always been a prolific domain for generating complex satisfiability and optimization problems. For instance, the synthesis of multi-scale models of biological networks has recently been associated with the resolution of optimization problems mixing Boolean logic and universally quantified linear constraints (OPT+qLP), which can be benchmarked on real-world models. In this paper, we introduce a Counter-Example-Guided Abstraction Refinement (CEGAR) to solve such problems efficiently. Our CEGAR exploits monotone properties inherent to linear optimization in order to generalize counter-examples of Boolean relaxations. We implemented our approach by extending Answer Set Programming (ASP) solver Clingo with a quantified linear constraints propagator. Our prototype enables exploiting independence of sub-formulas to further exploit the generalization of counter-examples. We evaluate the impact of refinement and partitioning on two sets of OPT+qLP problems inspired by system biology. Additionally, we conducted a comparison with the state-of-the-art ASP solver Clingo[lpx] that handles non-quantified linear constraints, showing the advantage of our CEGAR approach for solving large problems. \ No newline at end of file diff --git a/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning b/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning new file mode 100644 index 0000000000..5d085cad77 --- /dev/null +++ b/data/2024/aaai/CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning @@ -0,0 +1 @@ +Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints. \ No newline at end of file diff --git a/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset b/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset new file mode 100644 index 0000000000..86d7b6ea29 --- /dev/null +++ b/data/2024/aaai/CFEVER: A Chinese Fact Extraction and VERification Dataset @@ -0,0 +1 @@ +We present CFEVER, a Chinese dataset designed for Fact Extraction and VERification. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. Each claim in CFEVER is labeled as “Supports”, “Refutes”, or “Not Enough Info” to depict its degree of factualness. Similar to the FEVER dataset, claims in the “Supports” and “Refutes” categories are also annotated with corresponding evidence sentences sourced from single or multiple pages in Chinese Wikipedia. Our labeled dataset holds a Fleiss’ kappa value of 0.7934 for five-way inter-annotator agreement. In addition, through the experiments with the state-of-the-art approaches developed on the FEVER dataset and a simple baseline for CFEVER, we demonstrate that our dataset is a new rigorous benchmark for factual extraction and verification, which can be further used for developing automated systems to alleviate human fact-checking efforts. CFEVER is available at https://ikmlab.github.io/CFEVER. \ No newline at end of file diff --git a/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation b/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation new file mode 100644 index 0000000000..2ca3965309 --- /dev/null +++ b/data/2024/aaai/CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation @@ -0,0 +1 @@ +The click-based interactive segmentation aims to extract the object of interest from an image with the guidance of user clicks. Recent work has achieved great overall performance by employing feedback from the output. However, in most state-of-the-art approaches, 1) the inference stage involves inflexible heuristic rules and requires a separate refinement model, and 2) the number of user clicks and model performance cannot be balanced. To address the challenges, we propose a click-based and mask-guided interactive image segmentation framework containing three novel components: Cascade-Forward Refinement (CFR), Iterative Click Loss (ICL), and SUEM image augmentation. The CFR offers a unified inference framework to generate segmentation results in a coarse-to-fine manner. The proposed ICL allows model training to improve segmentation and reduce user interactions simultaneously. The proposed SUEM augmentation is a comprehensive way to create large and diverse training sets for interactive image segmentation. Extensive experiments demonstrate the state-of-the-art performance of the proposed approach on five public datasets. Remarkably, our model reduces by 33.2%, and 15.5% the number of clicks required to surpass an IoU of 0.95 in the previous state-of-the-art approach on the Berkeley and DAVIS sets, respectively. \ No newline at end of file diff --git a/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation b/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation new file mode 100644 index 0000000000..c6f0b2f2c2 --- /dev/null +++ b/data/2024/aaai/CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation @@ -0,0 +1 @@ +Few-shot semantic segmentation (FSS) aims to segment unseen objects in a query image using a few pixel-wise annotated support images, thus expanding the capabilities of semantic segmentation. The main challenge lies in extracting sufficient information from the limited support images to guide the segmentation process. Conventional methods typically address this problem by generating single or multiple prototypes from the support images and calculating their cosine similarity to the query image. However, these methods often fail to capture meaningful information for modeling the de facto joint distribution of pixel and category. Consequently, they result in incomplete segmentation of foreground objects and mis-segmentation of the complex background. To overcome this issue, we propose the Cross Gaussian Mixture Generative Model (CGMGM), a novel Gaussian Mixture Models~(GMMs)-based FSS method, which establishes the joint distribution of pixel and category in both the support and query images. Specifically, our method initially matches the feature representations of the query image with those of the support images to generate and refine an initial segmentation mask. It then employs GMMs to accurately model the joint distribution of foreground and background using the support masks and the initial segmentation mask. Subsequently, a parametric decoder utilizes the posterior probability of pixels in the query image, by applying the Bayesian theorem, to the joint distribution, to generate the final segmentation mask. Experimental results on PASCAL-5i and COCO-20i datasets demonstrate our CGMGM's effectiveness and superior performance compared to the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All b/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All new file mode 100644 index 0000000000..654ca7ac9e --- /dev/null +++ b/data/2024/aaai/CGS-Mask: Making Time Series Predictions Intuitive for All @@ -0,0 +1 @@ +Artificial intelligence (AI) has immense potential in time series prediction, but most explainable tools have limited capabilities in providing a systematic understanding of important features over time. These tools typically rely on evaluating a single time point, overlook the time ordering of inputs, and neglect the time-sensitive nature of time series applications. These factors make it difficult for users, particularly those without domain knowledge, to comprehend AI model decisions and obtain meaningful explanations. We propose CGS-Mask, a post-hoc and model-agnostic cellular genetic strip mask-based saliency approach to address these challenges. CGS-Mask uses consecutive time steps as a cohesive entity to evaluate the impact of features on the final prediction, providing binary and sustained feature importance scores over time. Our algorithm optimizes the mask population iteratively to obtain the optimal mask in a reasonable time. We evaluated CGS-Mask on synthetic and real-world datasets, and it outperformed state-of-the-art methods in elucidating the importance of features over time. According to our pilot user study via a questionnaire survey, CGS-Mask is the most effective approach in presenting easily understandable time series prediction results, enabling users to comprehend the decision-making process of AI models with ease. \ No newline at end of file diff --git a/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information b/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information new file mode 100644 index 0000000000..b707174741 --- /dev/null +++ b/data/2024/aaai/CHICOT: A Developer-Assistance Toolkit for Code Search with High-Level Contextual Information @@ -0,0 +1,5 @@ +We propose a source code search system named CHICOT (Code search with HIgh level COnText) to assist developers in reusing existing code. +While previous studies have examined code search on the basis of code-level, fine-grained specifications such as functionality, logic, or implementation, CHICOT addresses a unique mission: code search with high-level contextual information, such as the purpose or domain of a developer's project. +It achieves this feature by first extracting the context information from codebases and then considering this context during the search. +It provides a VSCode plugin for daily coding assistance, and the built-in crawler ensures up-to-date code suggestions. +The case study attests to the utility of CHICOT in real-world scenarios. \ No newline at end of file diff --git a/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System b/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System new file mode 100644 index 0000000000..c64dc7ac98 --- /dev/null +++ b/data/2024/aaai/CHRONOS: A Schema-Based Event Understanding and Prediction System @@ -0,0 +1 @@ +Chronological and Hierarchical Reasoning Over Naturally Occurring Schemas (CHRONOS) is a system that combines language model-based natural language processing with symbolic knowledge representations to analyze and make predictions about newsworthy events. CHRONOS consists of an event-centric information extraction pipeline and a complex event schema instantiation and prediction system. Resulting predictions are detailed with arguments, event types from Wikidata, schema-based justifications, and source document provenance. We evaluate our system by its ability to capture the structure of unseen events described in news articles and make plausible predictions as judged by human annotators. \ No newline at end of file diff --git a/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph b/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph new file mode 100644 index 0000000000..dd0f68fa6b --- /dev/null +++ b/data/2024/aaai/CI-STHPAN: Pre-trained Attention Network for Stock Selection with Channel-Independent Spatio-Temporal Hypergraph @@ -0,0 +1 @@ +Quantitative stock selection is one of the most challenging FinTech tasks due to the non-stationary dynamics and complex market dependencies. Existing studies rely on channel mixing methods, exacerbating the issue of distribution shift in financial time series. Additionally, complex model structures they build make it difficult to handle very long sequences. Furthermore, most of them are based on predefined stock relationships thus making it difficult to capture the dynamic and highly volatile stock markets. To address the above issues, in this paper, we propose Channel-Independent based Spatio-Temporal Hypergraph Pre-trained Attention Networks (CI-STHPAN), a two-stage framework for stock selection, involving Transformer and HGAT based stock time series self-supervised pre-training and stock-ranking based downstream task fine-tuning. We calculate the similarity of stock time series of different channel in dynamic intervals based on Dynamic Time Warping (DTW), and further construct channel-independent stock dynamic hypergraph based on the similarity. Experiments with NASDAQ and NYSE markets data over five years show that our framework outperforms SOTA approaches in terms of investment return ratio (IRR) and Sharpe ratio (SR). Additionally, we find that even without introducing graph information, self-supervised learning based on the vanilla Transformer Encoder also surpasses SOTA results. Notable improvements are gained on the NYSE market. It is mainly attributed to the improvement of fine-tuning approach on Information Coefficient (IC) and Information Ratio based IC (ICIR), indicating that the fine-tuning method enhances the accuracy and stability of the model prediction. \ No newline at end of file diff --git a/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem b/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem new file mode 100644 index 0000000000..49579eea14 --- /dev/null +++ b/data/2024/aaai/CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem @@ -0,0 +1,2 @@ +The minimal feature removal problem in the post-hoc explanation area aims to identify the minimal feature set (MFS). Prior studies using the greedy algorithm to calculate the minimal feature set lack the exploration of feature interactions under a monotonic assumption which cannot be satisfied in general scenarios. In order to address the above limitations, +we propose a Cooperative Integrated Dynamic Refining method (CIDR) to efficiently discover minimal feature sets. Specifically, we design Cooperative Integrated Gradients (CIG) to detect interactions between features. By incorporating CIG and characteristics of the minimal feature set, we transform the minimal feature removal problem into a knapsack problem. Additionally, we devise an auxiliary Minimal Feature Refinement algorithm to determine the minimal feature set from numerous candidate sets. To the best of our knowledge, our work is the first to address the minimal feature removal problem in the field of natural language processing. Extensive experiments demonstrate that CIDR is capable of tracing representative minimal feature sets with improved interpretability across various models and datasets. \ No newline at end of file diff --git a/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation b/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation new file mode 100644 index 0000000000..5260e3dea2 --- /dev/null +++ b/data/2024/aaai/CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation @@ -0,0 +1 @@ +New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present a meticulously designed evaluation benchmark that leverages the knowledge graph. This evaluation comprises 584 level-1 knowledge points and 1,989 level-2 knowledge points, thereby encompassing a comprehensive spectrum of the K12 education domain knowledge. The primary objective is to comprehensively assess the high-level comprehension aptitude and reasoning capabilities of LLMs operating within the Chinese context. Our evaluation incorporates five distinct question types with 39,452 questions. We test the current mainstream LLMs by three distinct modes. Firstly, four prompt evaluation modes were employed to assess the fundamental capacity. Additionally, for choice questions, a result-oriented evaluation approach was designed through data augmentation to assess the model's proficiency in advanced knowledge and reasoning. Moreover, a subset with reasoning process is derived, and the process-oriented testing method is used to test the model's interpretability and higher-order reasoning capacity. We further show models' capability in our knowledge points, and anticipate the evaluation can assist in the assessment of the strengths and deficiencies of LLMs on knowledge points, thus fostering their development within the Chinese context. Our Dataset will be publicly available in https://github.com/tal-tech/chinese-k12-evaluation. \ No newline at end of file diff --git a/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer b/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer new file mode 100644 index 0000000000..f816fc0225 --- /dev/null +++ b/data/2024/aaai/CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer @@ -0,0 +1 @@ +Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval. \ No newline at end of file diff --git a/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation b/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation new file mode 100644 index 0000000000..c1401b7a95 --- /dev/null +++ b/data/2024/aaai/CLIM: Contrastive Language-Image Mosaic for Region Representation @@ -0,0 +1,2 @@ +Detecting objects accurately from a large or open vocabulary necessitates the vision-language alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a ‘pseudo region’. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally +applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM. \ No newline at end of file diff --git a/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model b/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model new file mode 100644 index 0000000000..ec25bd8d02 --- /dev/null +++ b/data/2024/aaai/CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model @@ -0,0 +1 @@ +Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations. \ No newline at end of file diff --git a/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data b/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data new file mode 100644 index 0000000000..d2ef05f9a5 --- /dev/null +++ b/data/2024/aaai/CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data @@ -0,0 +1 @@ +Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the class-distribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier. Extensive experimental results on several benchmarks demonstrate that CLIP2FL achieves impressive performance and effectively deals with data heterogeneity and long-tail distribution. The code is available at https://github.com/shijiangming1/CLIP2FL. \ No newline at end of file diff --git a/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare b/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare new file mode 100644 index 0000000000..9aaa68f2be --- /dev/null +++ b/data/2024/aaai/CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare @@ -0,0 +1,2 @@ +In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care. +Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements. \ No newline at end of file diff --git a/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection b/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection new file mode 100644 index 0000000000..9f81cf0277 --- /dev/null +++ b/data/2024/aaai/CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection @@ -0,0 +1 @@ +Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry b/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry new file mode 100644 index 0000000000..9c1260ba43 --- /dev/null +++ b/data/2024/aaai/CMG-Net: Robust Normal Estimation for Point Clouds via Chamfer Normal Distance and Multi-Scale Geometry @@ -0,0 +1 @@ +This work presents an accurate and robust method for estimating normals from point clouds. In contrast to predecessor approaches that minimize the deviations between the annotated and the predicted normals directly, leading to direction inconsistency, we first propose a new metric termed Chamfer Normal Distance to address this issue. This not only mitigates the challenge but also facilitates network training and substantially enhances the network robustness against noise. Subsequently, we devise an innovative architecture that encompasses Multi-scale Local Feature Aggregation and Hierarchical Geometric Information Fusion. This design empowers the network to capture intricate geometric details more effectively and alleviate the ambiguity in scale selection. Extensive experiments demonstrate that our method achieves the state-of-the-art performance on both synthetic and real-world datasets, particularly in scenarios contaminated by noise. Our implementation is available at https://github.com/YingruiWoo/CMG-Net_Pytorch. \ No newline at end of file diff --git a/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks b/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks new file mode 100644 index 0000000000..52193b4180 --- /dev/null +++ b/data/2024/aaai/COMBAT: Alternated Training for Effective Clean-Label Backdoor Attacks @@ -0,0 +1 @@ +Backdoor attacks pose a critical concern to the practice of using third-party data for AI development. The data can be poisoned to make a trained model misbehave when a predefined trigger pattern appears, granting the attackers illegal benefits. While most proposed backdoor attacks are dirty-label, clean-label attacks are more desirable by keeping data labels unchanged to dodge human inspection. However, designing a working clean-label attack is a challenging task, and existing clean-label attacks show underwhelming performance. In this paper, we propose a novel mechanism to develop clean-label attacks with outstanding attack performance. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternating manner. Our proposed mechanism is flexible and customizable, allowing different backdoor trigger types and behaviors for either single or multiple target labels. Our backdoor attacks can reach near-perfect attack success rates and bypass all state-of-the-art backdoor defenses, as illustrated via comprehensive experiments on standard benchmark datasets. Our code is available at https://github.com/VinAIResearch/COMBAT. \ No newline at end of file diff --git a/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems b/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems new file mode 100644 index 0000000000..b1995e6cd5 --- /dev/null +++ b/data/2024/aaai/COMBHelper: A Neural Approach to Reduce Search Space for Graph Combinatorial Problems @@ -0,0 +1 @@ +Combinatorial Optimization (CO) problems over graphs appear routinely in many applications such as in optimizing traffic, viral marketing in social networks, and matching for job allocation. Due to their combinatorial nature, these problems are often NP-hard. Existing approximation algorithms and heuristics rely on the search space to find the solutions and become time-consuming when this space is large. In this paper, we design a neural method called COMBHelper to reduce this space and thus improve the efficiency of the traditional CO algorithms based on node selection. Specifically, it employs a Graph Neural Network (GNN) to identify promising nodes for the solution set. This pruned search space is then fed to the traditional CO algorithms. COMBHelper also uses a Knowledge Distillation (KD) module and a problem-specific boosting module to bring further efficiency and efficacy. Our extensive experiments show that the traditional CO algorithms with COMBHelper are at least 2 times faster than their original versions. \ No newline at end of file diff --git a/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning b/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning new file mode 100644 index 0000000000..be21c9011d --- /dev/null +++ b/data/2024/aaai/COMMA: Co-articulated Multi-Modal Learning @@ -0,0 +1 @@ +Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA. \ No newline at end of file diff --git a/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework b/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework new file mode 100644 index 0000000000..d6839e5e0c --- /dev/null +++ b/data/2024/aaai/CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework @@ -0,0 +1 @@ +Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider. \ No newline at end of file diff --git a/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection b/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection new file mode 100644 index 0000000000..86c22b653a --- /dev/null +++ b/data/2024/aaai/CPN: Complementary Proposal Network for Unconstrained Text Detection @@ -0,0 +1 @@ +Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released. \ No newline at end of file diff --git a/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization b/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization new file mode 100644 index 0000000000..9ef7efa138 --- /dev/null +++ b/data/2024/aaai/CR-SAM: Curvature Regularized Sharpness-Aware Minimization @@ -0,0 +1 @@ +The capacity to generalize to future unseen data stands as one of the utmost crucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. However, as training progresses, the non-linearity of the loss landscape increases, rendering one-step gradient ascent less effective. On the other hand, multi-step gradient ascent will incur higher training cost. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM), integrating the normalized Hessian trace as a SAM regularizer. Additionally, we present an efficient way to compute the trace via finite differences with parallelism. Our theoretical analysis based on PAC-Bayes bounds establishes the regularizer's efficacy in reducing generalization error. Empirical evaluation on CIFAR and ImageNet datasets shows that CR-SAM consistently enhances classification performance for ResNet and Vision Transformer (ViT) models across various datasets. Our code is available at https://github.com/TrustAIoT/CR-SAM. \ No newline at end of file diff --git a/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers b/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers new file mode 100644 index 0000000000..9ec308022b --- /dev/null +++ b/data/2024/aaai/CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers @@ -0,0 +1 @@ +Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN. \ No newline at end of file diff --git a/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems b/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems new file mode 100644 index 0000000000..62c41c3b10 --- /dev/null +++ b/data/2024/aaai/CREAD: A Classification-Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems @@ -0,0 +1 @@ +The watch time is a significant indicator of user satisfaction in video recommender systems. However, the prediction of watch time as a target variable is often hindered by its highly imbalanced distribution with a scarcity of observations for larger target values and over-populated samples for small values. State-of-the-art watch time prediction models discretize the continuous watch time into a set of buckets in order to consider the distribution of watch time. However, it is highly uninvestigated how these discrete buckets should be created from the continuous watch time distribution, and existing discretization approaches suffer from either a large learning error or a large restoration error. To address this challenge, we propose a Classification-Restoration framework with Error-Adaptive-Discretization (CREAD) to accurately predict the watch time. The proposed framework contains a discretization module, a classification module, and a restoration module. It predicts the watch time through multiple classification problems. The discretization process is a key contribution of the CREAD framework. We theoretically analyze the impacts of the discretization on the learning error and the restoration error, and then propose the error-adaptive discretization (EAD) technique to better balance the two errors, which achieves better performance over traditional discretization approaches. We conduct detailed offline evaluations on a public dataset and an industrial dataset, both showing performance gains through the proposed approach. Moreover, We have fully launched our framework to an online video platform, which resulted in a significant increase in users' video watch time by 0.29% through A/B testing. These results highlight the effectiveness of the CREAD framework in watch time prediction in video recommender systems. \ No newline at end of file diff --git a/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen b/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen new file mode 100644 index 0000000000..b884ab6543 --- /dev/null +++ b/data/2024/aaai/CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen @@ -0,0 +1 @@ +Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. Our soft assignment and mask split methodologies enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks. \ No newline at end of file diff --git a/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM b/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM new file mode 100644 index 0000000000..251125bdd9 --- /dev/null +++ b/data/2024/aaai/CTO-SLAM: Contour Tracking for Object-Level Robust 4D SLAM @@ -0,0 +1 @@ +The demand for 4D ( 3D+time ) SLAM system is increasingly urgent, especially for decision-making and scene understanding. However, most of the existing simultaneous localization and mapping ( SLAM ) systems primarily assume static environments. They fail to represent dynamic scenarios due to the challenge of establishing robust long-term spatiotemporal associations in dynamic object tracking. We address this limitation and propose CTO-SLAM, a monocular and RGB-D object-level 4D SLAM system to track moving objects and estimate their motion simultaneously. In this paper, we propose contour tracking, which introduces contour features to enhance the keypoint representation of dynamic objects and coupled with pixel tracking to achieve long-term robust object tracking. Based on contour tracking, we propose a novel sampling-based object pose initialization algorithm and the following adapted bundle adjustment ( BA ) optimization algorithm to estimate dynamic object poses with high accuracy. The CTO-SLAM system is verified on both KITTI and VKITTI datasets. The experimental results demonstrate that our system effectively addresses cumulative errors in long-term spatiotemporal association and hence obtains substantial improvements over the state-of-the-art systems. The source code is available at https://github.com/realXiaohan/CTO-SLAM. \ No newline at end of file diff --git a/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning b/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning new file mode 100644 index 0000000000..e7ee7ae118 --- /dev/null +++ b/data/2024/aaai/CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning @@ -0,0 +1 @@ +Offline reinforcement learning (RL) aims to learn an effective policy from a pre-collected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite. \ No newline at end of file diff --git a/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series b/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series new file mode 100644 index 0000000000..e137b1826e --- /dev/null +++ b/data/2024/aaai/CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series @@ -0,0 +1 @@ +Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and huge causal graphs. Moreover, the missing entries in the observations further hamper the causal structural learning. To overcome these limitations, We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS and raises the scalability by introducing a technique called Coarse-to-fine-discovery (C2FD) and leveraging a message-passing-based graph neural network (MPGNN). Compared to previous methods on simulated, quasi-real, and real datasets, we show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling. \ No newline at end of file diff --git a/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification b/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification new file mode 100644 index 0000000000..2e8b6b6131 --- /dev/null +++ b/data/2024/aaai/CaMIL: Causal Multiple Instance Learning for Whole Slide Image Classification @@ -0,0 +1 @@ +Whole slide image (WSI) classification is a crucial component in automated pathology analysis. Due to the inherent challenges of high-resolution WSIs and the absence of patch-level labels, most of the proposed methods follow the multiple instance learning (MIL) formulation. While MIL has been equipped with excellent instance feature extractors and aggregators, it is prone to learn spurious associations that undermine the performance of the model. For example, relying solely on color features may lead to erroneous diagnoses due to spurious associations between the disease and the color of patches. To address this issue, we develop a causal MIL framework for WSI classification, effectively distinguishing between causal and spurious associations. Specifically, we use the expectation of the intervention P(Y | do(X)) for bag prediction rather than the traditional likelihood P(Y | X). By applying the front-door adjustment, the spurious association is effectively blocked, where the intervened mediator is aggregated from patch-level features. We evaluate our proposed method on two publicly available WSI datasets, Camelyon16 and TCGA-NSCLC. Our causal MIL framework shows outstanding performance and is plug-and-play, seamlessly integrating with various feature extractors and aggregators. \ No newline at end of file diff --git a/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde b/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde new file mode 100644 index 0000000000..27a656146a --- /dev/null +++ b/data/2024/aaai/Cached Transformers: Improving Transformers with Differentiable Memory Cachde @@ -0,0 +1 @@ +This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations. \ No newline at end of file diff --git a/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models b/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models new file mode 100644 index 0000000000..8cb5612400 --- /dev/null +++ b/data/2024/aaai/CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models @@ -0,0 +1 @@ +Camouflaged Object Detection (COD) is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. Existing COD methods struggle with nuanced object boundaries and overconfident incorrect predictions. In response, we propose a new paradigm that treats COD as a conditional mask-generation task leveraging diffusion models. Our method, dubbed CamoDiffusion, employs the denoising process to progressively refine predictions while incorporating image conditions. Due to the stochastic sampling process of diffusion, our model is capable of sampling multiple possible predictions, avoiding the problem of overconfident point estimation. Moreover, we develop specialized network architecture, training, and sampling strategies, to enhance the model’s expressive power, refinement capabilities and suppress overconfident mis-segmentations, thus aptly tailoring the diffusion model to the demands of COD. Extensive experiments on three COD datasets attest to the superior performance of our model compared to existing state-of-the-art methods, particularly on the most challenging COD10K dataset, where our approach achieves 0.019 in terms of MAE. Codes and models are available at https://github.com/Rapisurazurite/CamoDiffusion. \ No newline at end of file diff --git a/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation b/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation new file mode 100644 index 0000000000..91db7e7462 --- /dev/null +++ b/data/2024/aaai/Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation @@ -0,0 +1 @@ +Recently, large language models (LLMs) have shown an extraordinary ability to understand natural language and generate programming code. It has been a common practice for software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability, and robustness of the code generation from LLMs have not yet been thoroughly studied. The executable code is not equivalent to reliable and robust code, especially in the context of real-world software development. For example, the misuse of APIs in the generated code could lead to severe problems, such as resource leaks, program crashes, etc. Existing code evaluation benchmarks and datasets focus on crafting small tasks such as programming questions in coding interviews, which, however, deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from Stack Overflow on 18 representative Java APIs. We summarize the common misuse patterns of these APIs and evaluate them on current popular LLMs. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause unexpected consequences if the code is introduced into real-world software. \ No newline at end of file diff --git a/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning b/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning new file mode 100644 index 0000000000..37d4989b48 --- /dev/null +++ b/data/2024/aaai/Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning @@ -0,0 +1 @@ +This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this union, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) – an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical signal in concert with the CS approach as part of a two-stage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future. \ No newline at end of file diff --git a/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis b/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis new file mode 100644 index 0000000000..cd86f5d1f8 --- /dev/null +++ b/data/2024/aaai/Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis @@ -0,0 +1 @@ +Game theory, as an analytical tool, is frequently utilized to analyze human behavior in social science research. With the high alignment between the behavior of Large Language Models (LLMs) and humans, a promising research direction is to employ LLMs as substitutes for humans in game experiments, enabling social science research. However, despite numerous empirical researches on the combination of LLMs and game theory, the capability boundaries of LLMs in game theory remain unclear. In this research, we endeavor to systematically analyze LLMs in the context of game theory. Specifically, rationality, as the fundamental principle of game theory, serves as the metric for evaluating players' behavior --- building a clear desire, refining belief about uncertainty, and taking optimal actions. Accordingly, we select three classical games (dictator game, Rock-Paper-Scissors, and ring-network game) to analyze to what extent LLMs can achieve rationality in these three aspects. The experimental results indicate that even the current state-of-the-art LLM (GPT-4) exhibits substantial disparities compared to humans in game theory. For instance, LLMs struggle to build desires based on uncommon preferences, fail to refine belief from many simple patterns, and may overlook or modify refined belief when taking actions. Therefore, we consider that introducing LLMs into game experiments in the field of social science should be approached with greater caution. \ No newline at end of file diff --git a/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions? b/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions? new file mode 100644 index 0000000000..c9eaa0f56f --- /dev/null +++ b/data/2024/aaai/Can Large Language Models Understand Real-World Complex Instructions? @@ -0,0 +1 @@ +Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO. \ No newline at end of file diff --git a/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated b/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated new file mode 100644 index 0000000000..2709df8c90 --- /dev/null +++ b/data/2024/aaai/Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated @@ -0,0 +1 @@ +Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training. \ No newline at end of file diff --git a/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time b/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time new file mode 100644 index 0000000000..285e7b198c --- /dev/null +++ b/data/2024/aaai/Carbon Footprint Reduction for Sustainable Data Centers in Real-Time @@ -0,0 +1 @@ +As machine learning workloads are significantly increasing energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry-standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions. \ No newline at end of file diff --git a/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning b/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning new file mode 100644 index 0000000000..773008ae67 --- /dev/null +++ b/data/2024/aaai/CariesXrays: Enhancing Caries Detection in Hospital-Scale Panoramic Dental X-rays via Feature Pyramid Contrastive Learning @@ -0,0 +1 @@ +Dental caries has been widely recognized as one of the most prevalent chronic diseases in the field of public health. Despite advancements in automated diagnosis across various medical domains, it remains a substantial challenge for dental caries detection due to its inherent variability and intricacies. To bridge this gap, we release a hospital-scale panoramic dental X-ray benchmark, namely “CariesXrays”, to facilitate the advancements in high-precision computer-aided diagnosis for dental caries. It comprises 6,000 panoramic dental X-ray images, with a total of 13,783 instances of dental caries, all meticulously annotated by dental professionals. In this paper, we propose a novel Feature Pyramid Contrastive Learning (FPCL) framework, that jointly incorporates feature pyramid learning and contrastive learning within a unified diagnostic paradigm for automated dental caries detection. Specifically, a robust dual-directional feature pyramid network (D2D-FPN) is designed to adaptively capture rich and informative contextual information from multi-level feature maps, thus enhancing the generalization ability of caries detection across different scales. Furthermore, our model is augmented with an effective proposals-prototype contrastive regularization learning (P2P-CRL) mechanism, which can flexibly bridge the semantic gaps among diverse dental caries with varying appearances, resulting in high-quality dental caries proposals. Extensive experiments on our newly-established CariesXrays benchmark demonstrate the potential of FPCL to make a significant social impact on caries diagnosis. \ No newline at end of file diff --git a/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer b/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer new file mode 100644 index 0000000000..55f401264b --- /dev/null +++ b/data/2024/aaai/CatFormer: Category-Level 6D Object Pose Estimation with Transformer @@ -0,0 +1 @@ +Although there has been significant progress in category-level object pose estimation in recent years, there is still considerable room for improvement. In this paper, we propose a novel transformer-based category-level 6D pose estimation method called CatFormer to enhance the accuracy pose estimation. CatFormer comprises three main parts: a coarse deformation part, a fine deformation part, and a recurrent refinement part. In the coarse and fine deformation sections, we introduce a transformer-based deformation module that performs point cloud deformation and completion in the feature space. Additionally, after each deformation, we incorporate a transformer-based graph module to adjust fused features and establish geometric and topological relationships between points based on these features. Furthermore, we present an end-to-end recurrent refinement module that enables the prior point cloud to deform multiple times according to real scene features. We evaluate CatFormer's performance by training and testing it on CAMERA25 and REAL275 datasets. Experimental results demonstrate that CatFormer surpasses state-of-the-art methods. Moreover, we extend the usage of CatFormer to instance-level object pose estimation on the LINEMOD dataset, as well as object pose estimation in real-world scenarios. The experimental results validate the effectiveness and generalization capabilities of CatFormer. Our code and the supplemental materials are avaliable at https://github.com/BIT-robot-group/CatFormer. \ No newline at end of file diff --git a/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration b/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration new file mode 100644 index 0000000000..76f26667d2 --- /dev/null +++ b/data/2024/aaai/Catalyst for Clustering-Based Unsupervised Object Re-identification: Feature Calibration @@ -0,0 +1 @@ +Clustering-based methods are emerging as a ubiquitous technology in unsupervised object Re-Identification (ReID), which alternate between pseudo-label generation and representation learning. Recent advances in this field mainly fall into two groups: pseudo-label correction and robust representation learning. Differently, in this work, we improve unsupervised object ReID from feature calibration, a completely different but complementary insight from the current approaches. Specifically, we propose to insert a conceptually simple yet empirically powerful Feature Calibration Module (FCM) before pseudo-label generation. In practice, FCM calibrates the features using a nonparametric graph attention network, enforcing similar instances to move together in the feature space while allowing dissimilar instances to separate. As a result, we can generate more reliable pseudo-labels using the calibrated features and further improve subsequent representation learning. FCM is simple, effective, parameter-free, training-free, plug-and-play, and can be considered as a catalyst, increasing the ’chemical reaction’ between pseudo-label generation and representation learning. Moreover, it maintains the efficiency of testing time with negligible impact on training time. In this paper, we insert FCM into a simple baseline. Experiments across different scenarios and benchmarks show that FCM consistently improves the baseline (e.g., 8.2% mAP gain on MSMT17), and achieves the new state-of-the-art results. Code is available at: https://github.com/lhf12278/FCM-ReID. \ No newline at end of file diff --git a/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN b/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN new file mode 100644 index 0000000000..7aa5973f2e --- /dev/null +++ b/data/2024/aaai/Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN @@ -0,0 +1 @@ +Deep learning has made significant advances in computer vision, particularly in image classification tasks. Despite their high accuracy on training data, deep learning models often face challenges related to complexity and overfitting. One notable concern is that the model often relies heavily on a limited subset of filters for making predictions. This dependency can result in compromised generalization and an increased vulnerability to minor variations. While regularization techniques like weight decay, dropout, and data augmentation are commonly used to address this issue, they may not directly tackle the reliance on specific filters. Our observations reveal that the heavy reliance problem gets severe when slow-learning filters are deprived of learning opportunities due to fast-learning filters. Drawing inspiration from image augmentation research that combats over-reliance on specific image regions by removing and replacing parts of images, Our idea is to mitigate the problem of over-reliance on strong filters by substituting highly activated features. To this end, we present a novel method called Catch-up Mix, which provides learning opportunities to a wide range of filters during training, focusing on filters that may lag behind. By mixing activation maps with relatively lower norms, Catch-up Mix promotes the development of more diverse representations and reduces reliance on a small subset of filters. Experimental results demonstrate the superiority of our method in various vision classification datasets, providing enhanced robustness. \ No newline at end of file diff --git a/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization b/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization new file mode 100644 index 0000000000..1c739f17ad --- /dev/null +++ b/data/2024/aaai/CatmullRom Splines-Based Regression for Image Forgery Localization @@ -0,0 +1 @@ +IFL (Image Forgery Location) helps secure digital media forensics. However, many methods suffer from false detections (i.e., FPs) and inaccurate boundaries. In this paper, we proposed the CatmullRom Splines-based Regression Network (CSR-Net), which first rethinks the IFL task from the perspective of regression to deal with this problem. Specifically speaking, we propose an adaptive CutmullRom splines fitting scheme for coarse localization of the tampered regions. Then, for false positive cases, we first develop a novel re-scoring mechanism, which aims to filter out samples that cannot have responses on both the classification branch and the instance branch. Later on, to further restrict the boundaries, we design a learnable texture extraction module, which refines and enhances the contour representation by decoupling the horizontal and vertical forgery features to extract a more robust contour representation, thus suppressing FPs. Compared to segmentation-based methods, our method is simple but effective due to the unnecessity of post-processing. Extensive experiments show the superiority of CSR-Net to existing state-of-the-art methods, not only on standard natural image datasets but also on social media datasets. \ No newline at end of file diff --git a/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces b/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces new file mode 100644 index 0000000000..dee41fa08f --- /dev/null +++ b/data/2024/aaai/Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces @@ -0,0 +1 @@ +As responsible AI gains importance in machine learning algorithms, properties like fairness, adversarial robustness, and causality have received considerable attention in recent years. However, despite their individual significance, there remains a critical gap in simultaneously exploring and integrating these properties. In this paper, we propose a novel approach that examines the relationship between individual fairness, adversarial robustness, and structural causal models (SCMs) in heterogeneous data spaces, particularly when dealing with discrete sensitive attributes. We use SCMs and sensitive attributes to create a fair metric and apply it to measure semantic similarity among individuals. By introducing a novel causal adversarial perturbation (CAP) and applying adversarial training, we create a new regularizer that combines individual fairness, causality, and robustness in the classifier. Our method is evaluated on both real-world and synthetic datasets, demonstrating its effectiveness in achieving an accurate classifier that simultaneously exhibits fairness, adversarial robustness, and causal awareness. \ No newline at end of file diff --git a/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis b/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis new file mode 100644 index 0000000000..2bf289e9b8 --- /dev/null +++ b/data/2024/aaai/Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis @@ -0,0 +1 @@ +Count data naturally arise in many fields, such as finance, neuroscience, and epidemiology, and discovering causal structure among count data is a crucial task in various scientific and industrial scenarios. One of the most common characteristics of count data is the inherent branching structure described by a binomial thinning operator and an independent Poisson distribution that captures both branching and noise. For instance, in a population count scenario, mortality and immigration contribute to the count, where survival follows a Bernoulli distribution, and immigration follows a Poisson distribution. However, causal discovery from such data is challenging due to the non-identifiability issue: a single causal pair is Markov equivalent, i.e., X->Y and Y->X are distributed equivalent. Fortunately, in this work, we found that the causal order from X to its child Y is identifiable if X is a root vertex and has at least two directed paths to Y, or the ancestor of X with the most directed path to X has a directed path to Y without passing X. Specifically, we propose a Poisson Branching Structure Causal Model (PB-SCM) and perform a path analysis on PB-SCM using high-order cumulants. Theoretical results establish the connection between the path and cumulant and demonstrate that the path information can be obtained from the cumulant. With the path information, causal order is identifiable under some graphical conditions. A practical algorithm for learning causal structure under PB-SCM is proposed and the experiments demonstrate and verify the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention b/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention new file mode 100644 index 0000000000..28ba024467 --- /dev/null +++ b/data/2024/aaai/Causal Representation Learning via Counterfactual Intervention @@ -0,0 +1 @@ +Existing causal representation learning methods are based on the causal graph they build. However, due to the omission of bias within the causal graph, they essentially encourage models to learn biased causal effects in latent space. In this paper, we propose a novel causally disentangling framework that aims to learn unbiased causal effects. We first introduce inductive and dataset biases into traditional causal graph for the physical concepts of interest. Then, we eliminate the negative effects from these two biases by counterfactual intervention with reweighted loss function for learning unbiased causal effects. Finally, we employ the causal effects into the VAE to endow the latent representations with causality. In particular, we highlight that removing biases in this paper is regarded as a part of learning process for unbiased causal effects, which is crucial for causal disentanglement performance improvement. Through extensive experiments on real-world and synthetic datasets, we show that our method outperforms different baselines and obtains the state-of-the-art results for achieving causal representation learning. \ No newline at end of file diff --git a/data/2024/aaai/Causal Strategic Learning with Competitive Selection b/data/2024/aaai/Causal Strategic Learning with Competitive Selection new file mode 100644 index 0000000000..9eff4887e7 --- /dev/null +++ b/data/2024/aaai/Causal Strategic Learning with Competitive Selection @@ -0,0 +1,10 @@ +We study the problem of agent selection in causal strategic learning under multiple decision makers and address two key challenges that come with it. +Firstly, while much of prior work focuses on studying a fixed pool of agents that remains static regardless of their evaluations, we consider the impact of selection procedure by which agents are not only evaluated, but also selected. +When each decision maker unilaterally selects agents by maximising their own utility, we show that the optimal selection rule is a trade-off between selecting the best agents and providing incentives to maximise the agents' improvement. +Furthermore, this optimal selection rule relies on incorrect predictions of agents' outcomes. +Hence, we study the conditions under which a decision maker's optimal selection rule will not lead to deterioration of agents' outcome nor cause unjust reduction in agents' selection chance. +To that end, we provide an analytical form of the optimal selection rule and a mechanism to retrieve the causal parameters from observational data, under certain assumptions on agents' behaviour. +Secondly, when there are multiple decision makers, the interference between selection rules introduces another source of biases in estimating the underlying causal parameters. +To address this problem, we provide a cooperative protocol which all decision makers must collectively adopt to recover the true causal parameters. +Lastly, we complement our theoretical results with simulation studies. +Our results highlight not only the importance of causal modeling as a strategy to mitigate the effect of gaming, as suggested by previous work, but also the need of a benevolent regulator to enable it. \ No newline at end of file diff --git a/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment b/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment new file mode 100644 index 0000000000..e4b6848e98 --- /dev/null +++ b/data/2024/aaai/Causal Walk: Debiasing Multi-Hop Fact Verification with Front-Door Adjustment @@ -0,0 +1 @@ +Multi-hop fact verification aims to detect the veracity of the given claim by integrating and reasoning over multiple pieces of evidence. Conventional multi-hop fact verification models are prone to rely on spurious correlations from the annotation artifacts, leading to an obvious performance decline on unbiased datasets. Among the various debiasing works, the causal inference-based methods become popular by performing theoretically guaranteed debiasing such as casual intervention or counterfactual reasoning. However, existing causal inference-based debiasing methods, which mainly formulate fact verification as a single-hop reasoning task to tackle shallow bias patterns, cannot deal with the complicated bias patterns hidden in multiple hops of evidence. To address the challenge, we propose Causal Walk, a novel method for debiasing multi-hop fact verification from a causal perspective with front-door adjustment. Specifically, in the structural causal model, the reasoning path between the treatment (the input claim-evidence graph) and the outcome (the veracity label) is introduced as the mediator to block the confounder. With the front-door adjustment, the causal effect between the treatment and the outcome is decomposed into the causal effect between the treatment and the mediator, which is estimated by applying the idea of random walk, and the causal effect between the mediator and the outcome, which is estimated with normalized weighted geometric mean approximation. To investigate the effectiveness of the proposed method, an adversarial multi-hop fact verification dataset and a symmetric multi-hop fact verification dataset are proposed with the help of the large language model. Experimental results show that Causal Walk outperforms some previous debiasing methods on both existing datasets and the newly constructed datasets. Code and data will be released at https://github.com/zcccccz/CausalWalk. \ No newline at end of file diff --git a/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery b/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery new file mode 100644 index 0000000000..027def2432 --- /dev/null +++ b/data/2024/aaai/Causal-Driven Skill Prerequisite Structure Discovery @@ -0,0 +1 @@ +Knowing a prerequisite structure among skills in a subject domain effectively enables several educational applications, including intelligent tutoring systems and curriculum planning. Traditionally, educators or domain experts use intuition to determine the skills' prerequisite relationships, which is time-consuming and prone to fall into the trap of blind spots. In this paper, we focus on inferring the prerequisite structure given access to students' performance on exercises in a subject. Nevertheless, it is challenging since students' mastery of skills can not be directly observed, but can only be estimated, i.e., its latency in nature. To tackle this problem, we propose a causal-driven skill prerequisite structure discovery (CSPS) method in a two-stage learning framework. In the first stage, we learn the skills' correlation relationships presented in the covariance matrix from the student performance data while, through the predicted covariance matrix in the second stage, we consider a heuristic method based on conditional independence tests and standardized partial variance to discover the prerequisite structure. We demonstrate the performance of the new approach with both simulated and real-world data. The experimental results show the effectiveness of the proposed model for identifying the skills' prerequisite structure. \ No newline at end of file diff --git a/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval b/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval new file mode 100644 index 0000000000..7b174503fb --- /dev/null +++ b/data/2024/aaai/Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval @@ -0,0 +1 @@ +Text-based Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1) Independence: being independent of non-causal factors, and 2)Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization. \ No newline at end of file diff --git a/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control b/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control new file mode 100644 index 0000000000..0424e12101 --- /dev/null +++ b/data/2024/aaai/Causally Aware Generative Adversarial Networks for Light Pollution Control @@ -0,0 +1 @@ +Artificial light plays an integral role in modern cities, significantly enhancing human productivity and the efficiency of civilization. However, excessive illumination can lead to light pollution, posing non-negligible threats to economic burdens, ecosystems, and human health. Despite its critical importance, the exploration of its causes remains relatively limited within the field of artificial intelligence, leaving an incomplete understanding of the factors contributing to light pollution and sustainable illumination planning distant. To address this gap, we introduce a novel framework named Causally Aware Generative Adversarial Networks (CAGAN). This innovative approach aims to uncover the fundamental drivers of light pollution within cities and offer intelligent solutions for optimal illumination resource allocation in the context of sustainable urban development. We commence by examining light pollution across 33,593 residential areas in seven global metropolises. Our findings reveal substantial influences on light pollution levels from various building types, notably grasslands, commercial centers and residential buildings as significant contributors. These discovered causal relationships are seamlessly integrated into the generative modeling framework, guiding the process of generating light pollution maps for diverse residential areas. Extensive experiments showcase CAGAN’s potential to inform and guide the implementation of effective strategies to mitigate light pollution. Our code and data are publicly available at https://github.com/zhangyuuao/Light_Pollution_CAGAN. \ No newline at end of file diff --git a/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..274d76b04a --- /dev/null +++ b/data/2024/aaai/Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +While decentralized training is attractive in multi-agent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance. \ No newline at end of file diff --git a/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design b/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design new file mode 100644 index 0000000000..c6ef25bb84 --- /dev/null +++ b/data/2024/aaai/CcDPM: A Continuous Conditional Diffusion Probabilistic Model for Inverse Design @@ -0,0 +1 @@ +Engineering design methods aim to generate new designs that meet desired performance requirements. Past work has directly introduced conditional Generative Adversarial Networks (cGANs) into this field and achieved promising results in single-point design problems(one performance requirement under one working condition). However, these methods assume that the performance requirements are distributed in categorical space, which is not reasonable in these scenarios. Although Continuous conditional GANs (CcGANs) introduce Vicinal Risk Minimization (VRM) to reduce the performance loss caused by this assumption, they still face the following challenges: 1) CcGANs can not handle multi-point design problems (multiple performance requirements under multiple working conditions). 2) Their training process is time-consuming due to the high computational complexity of the vicinal loss. To address these issues, A Continuous conditional Diffusion Probabilistic Model (CcDPM) is proposed, which the first time introduces the diffusion model into the engineering design area and VRM into the diffusion model. CcDPM adopts a novel sampling method called multi-point design sampling to deal with multi-point design problems. Moreover, the k-d tree is used in the training process of CcDPM to shorten the calculation time of vicinal loss and speed up the training process by 2-300 times in our experiments. Experiments on a synthetic problem and three real-world design problems demonstrate that CcDPM outperforms the state-of-the-art GAN models. \ No newline at end of file diff --git a/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields b/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields new file mode 100644 index 0000000000..8561380f10 --- /dev/null +++ b/data/2024/aaai/Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields @@ -0,0 +1 @@ +Rendering photorealistic dynamic scenes has been a focus of recent research, with applications in virtual and augmented reality. While the Neural Radiance Field (NeRF) has shown remarkable rendering quality for static scenes, achieving real-time rendering of dynamic scenes remains challenging due to expansive computation for the time dimension. The incorporation of explicit-based methods, specifically voxel grids, has been proposed to accelerate the training and rendering of neural radiance fields with hybrid representation. However, employing a hybrid representation for dynamic scenes results in overfitting due to fast convergence, which can result in artifacts (e.g., floaters, noisy geometric) on novel views. To address this, we propose a compact and efficient method for dynamic neural radiance fields, namely Ced-NeRF which only require a small number of additional parameters to construct a hybrid representation of dynamic NeRF. Evaluation of dynamic scene datasets shows that our Ced-NeRF achieves fast rendering speeds while maintaining high-quality rendering results. Our method outperforms the current state-of-the-art methods in terms of quality, training and rendering speed. \ No newline at end of file diff --git a/data/2024/aaai/Cell Graph Transformer for Nuclei Classification b/data/2024/aaai/Cell Graph Transformer for Nuclei Classification new file mode 100644 index 0000000000..2b12059160 --- /dev/null +++ b/data/2024/aaai/Cell Graph Transformer for Nuclei Classification @@ -0,0 +1 @@ +Nuclei classification is a critical step in computer-aided diagnosis with histopathology images. In the past, various methods have employed graph neural networks (GNN) to analyze cell graphs that model inter-cell relationships by considering nuclei as vertices. However, they are limited by the GNN mechanism that only passes messages among local nodes via fixed edges. To address the issue, we develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Nevertheless, training the transformer with a cell graph presents another challenge. Poorly initialized features can lead to noisy self-attention scores and inferior convergence, particularly when processing the cell graphs with numerous connections. Thus, we further propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor. The pre-trained features may suppress unreasonable correlations and hence ease the finetuning of CGT. Experimental results suggest that the proposed cell graph transformer with topology-aware pretraining significantly improves the nuclei classification results, and achieves the state-of-the-art performance. Code and models are available at https://github.com/lhaof/CGT \ No newline at end of file diff --git a/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control b/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control new file mode 100644 index 0000000000..76301cc825 --- /dev/null +++ b/data/2024/aaai/Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control @@ -0,0 +1 @@ +This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech. Previous studies have focused on incorporating additional modalities to enhance the quality of generated gestures. However, these methods perform poorly when certain modalities are missing during inference. To address this problem, we suggest using speech-derived multimodal priors to improve gesture generation. We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures. Our approach utilizes a chain-like modeling method to generate facial blendshapes, body movements, and hand gestures sequentially. Specifically, we incorporate rhythm cues derived from facial deformation and stylization prior based on speech emotions, into the process of generating gestures. By incorporating multimodal priors, our method improves the quality of generated gestures and eliminate the need for expensive setup preparation during inference. Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models b/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models new file mode 100644 index 0000000000..3e436730a1 --- /dev/null +++ b/data/2024/aaai/Chain-of-Thought Improves Text Generation with Citations in Large Language Models @@ -0,0 +1 @@ +Previous studies disclose that Large Language Models (LLMs) suffer from hallucinations when generating texts, bringing a novel and challenging research topic to the public, which centers on enabling LLMs to generate texts with citations. Existing work exposes two limitations when using LLMs to generate answers to questions with provided documents: unsatisfactory answer correctness and poor citation quality. To tackle the above issues, we investigate using Chain-of-Thought (CoT) to elicit LLMs’ ability to synthesize correct answers from multiple documents, as well as properly cite these documents. Moreover, we propose a Citation Insurance Mechanism, which enables LLMs to detect and cite those missing citations. We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the CoT prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best ChatGPT-based baselines. Extensive analyses further validate the effectiveness of the proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse b/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse new file mode 100644 index 0000000000..a2aff280a3 --- /dev/null +++ b/data/2024/aaai/Characterizing Information Seeking Events in Health-Related Social Discourse @@ -0,0 +1 @@ +Social media sites have become a popular platform for individuals to seek and share health information. Despite the progress in natural language processing for social media mining, a gap remains in analyzing health-related texts on social discourse in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts. \ No newline at end of file diff --git a/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective b/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective new file mode 100644 index 0000000000..704b60614d --- /dev/null +++ b/data/2024/aaai/Chasing Fairness in Graphs: A GNN Architecture Perspective @@ -0,0 +1,3 @@ +There has been significant progress in improving the performance of graph neural networks (GNNs) through enhancements in graph data, model architecture design, and training strategies. For fairness in graphs, recent studies achieve fair representations and predictions through either graph data pre-processing (e.g., node feature masking, and topology rewiring) or fair training strategies (e.g., regularization, adversarial debiasing, and fair contrastive learning). How to achieve fairness in graphs from the model architecture perspective is less explored. More importantly, GNNs exhibit worse fairness performance compared to multilayer perception since their model architecture (i.e., neighbor aggregation) amplifies biases. To this end, we aim to achieve fairness via a new GNN architecture. We propose Fair Message Passing (FMP) designed within a unified optimization framework for GNNs. Notably, FMP explicitly renders sensitive attribute usage in forward propagation for node classification task using cross-entropy loss without data pre-processing. In FMP, the aggregation is first adopted to utilize neighbors' information and then the bias mitigation step explicitly pushes demographic group node presentation centers together. +In this way, FMP scheme can aggregate useful information from neighbors and mitigate bias to achieve better fairness and prediction tradeoff performance. +Experiments on node classification tasks demonstrate that the proposed FMP outperforms several baselines in terms of fairness and accuracy on three real-world datasets. The code is available at https://github.com/zhimengj0326/FMP. \ No newline at end of file diff --git a/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract) b/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract) new file mode 100644 index 0000000000..8943ccd0a1 --- /dev/null +++ b/data/2024/aaai/ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract) @@ -0,0 +1 @@ +In the era of large language models like Chatgpt, maintaining academic integrity in programming education has become challenging due to potential misuse. There's a pressing need for reliable detectors to identify Chatgpt-generated code. While previous studies have tackled model-generated text detection, identifying such code remains uncharted territory. In this paper, we introduce a novel method to discern Chatgpt-generated code. We employ targeted masking perturbation, emphasizing code sections with high perplexity. Fine-tuned CodeBERT is utilized to replace these masked sections, generating subtly perturbed samples. Our scoring system amalgamates overall perplexity, variations in code line perplexity, and burstiness. In this scoring scheme, a higher rank for the original code suggests it's more likely to be chatgpt-generated. The underlying principle is that code generated by models typically exhibits consistent, low perplexity and reduced burstiness, with its ranking remaining relatively stable even after subtle modifications. In contrast, human-written code, when perturbed, is more likely to produce samples that the model prefers. Our approach significantly outperforms current detectors, especially against OpenAI's text-davinci-003 model, with the average AUC rising from 0.56 (GPTZero baseline) to 0.87. \ No newline at end of file diff --git a/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing b/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing new file mode 100644 index 0000000000..0ab31a238e --- /dev/null +++ b/data/2024/aaai/Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing @@ -0,0 +1 @@ +Deep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, robotics, and system scheduling. Distributed algorithms and architectures have been vastly proposed (e.g., actor-learner architecture) to accelerate DRL training with large-scale server-based clusters. However, training on-policy algorithms with the actor-learner architecture unavoidably induces resource wasting due to synchronization between learners and actors, thus resulting in significantly extra billing. As a promising alternative, serverless computing naturally fits on-policy synchronization and alleviates resource wasting in distributed DRL training with pay-as-you-go pricing. Yet, none has leveraged serverless computing to facilitate DRL training. This paper proposes MinionsRL, the first serverless distributed DRL training framework that aims to accelerate DRL training- and cost-efficiency with dynamic actor scaling. We prototype MinionsRL on top of Microsoft Azure Container Instances and evaluate it with popular DRL tasks from OpenAI Gym. Extensive experiments show that MinionsRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions. \ No newline at end of file diff --git a/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport b/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport new file mode 100644 index 0000000000..1ee0554d6f --- /dev/null +++ b/data/2024/aaai/Check-In Desk Scheduling Optimisation at CDG International Airport @@ -0,0 +1,4 @@ +More than ever, air transport players (i.e., airline and airport companies) in an intensely competitive climate need to benefit from a carefully optimized management of airport resources to improve the quality of service and control the induced costs. +In this paper, we investigate the Airport Check-in Desk Assignment Problem. +We propose a Constraint Programming (CP) model for this problem, and present some promising experimental results from data coming from ADP (Aéroport de Paris). +Our works are deployed in a preprod environment since 1 year. \ No newline at end of file diff --git a/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model b/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model new file mode 100644 index 0000000000..f93427d1ce --- /dev/null +++ b/data/2024/aaai/Chinese Spelling Correction as Rephrasing Language Model @@ -0,0 +1 @@ +This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose Rephrasing Language Modeling (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-theart results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks. \ No newline at end of file diff --git a/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing b/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing new file mode 100644 index 0000000000..dfcd005335 --- /dev/null +++ b/data/2024/aaai/ChromaFusionNet (CFNet): Natural Fusion of Fine-Grained Color Editing @@ -0,0 +1 @@ +Digital image enhancement aims to deliver visually striking, pleasing images that align with human perception. While global techniques can elevate the image's overall aesthetics, fine-grained color enhancement can further boost visual appeal and expressiveness. However, colorists frequently face challenges in achieving accurate, localized color adjustments. Direct composition of these local edits can result in spatial color inconsistencies. Existing methods, including color style transfer and image harmonization, exhibit inconsistencies, especially at boundary regions. Addressing this, we present ChromaFusionNet (CFNet), a novel approach that views the color fusion problem through the lens of image color inpainting. Built on the Vision Transformer architecture, CFNet captures global context and delivers high-fidelity outputs, seamlessly blending colors while preserving boundary integrity. Empirical studies on ImageNet and COCO datasets demonstrate CFNet's superiority over existing methods in maintaining color harmony and color fidelity. Robustness evaluations and user studies have further validated the effectiveness of CFNet. In conclusion, CFNet introduces an innovative approach to seamless, fine-grained color fusion, paving the way for advancements in the domain of fine-grained color editing. Code and pretrained models are available at our project page: https://yidong.pro/projects/cfnet. \ No newline at end of file diff --git a/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning b/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning new file mode 100644 index 0000000000..c1f7fdefba --- /dev/null +++ b/data/2024/aaai/Chronic Poisoning: Backdoor Attack against Split Learning @@ -0,0 +1 @@ +Split learning is a computing resource-friendly distributed learning framework that protects client training data by splitting the model between the client and server. Previous work has proved that split learning faces a severe risk of privacy leakage, as a malicious server can recover the client's private data by hijacking the training process. In this paper, we first explore the vulnerability of split learning to server-side backdoor attacks, where our goal is to compromise the model's integrity. Since the server-side attacker cannot access the training data and client model in split learning, the traditional poisoning-based backdoor attack methods are no longer applicable. Therefore, constructing backdoor attacks in split learning poses significant challenges. Our strategy involves the attacker establishing a shadow model on the server side that can encode backdoor samples and guiding the client model to learn from this model during the training process, thereby enabling the client to acquire the same capability. Based on these insights, we propose a three-stage backdoor attack framework named SFI. Our attack framework minimizes assumptions about the attacker's background knowledge and ensures that the attack process remains imperceptible to the client. We implement SFI on various benchmark datasets, and extensive experimental results demonstrate its effectiveness and generality. For example, success rates of our attack on MNIST, Fashion, and CIFAR10 datasets all exceed 90%, with limited impact on the main task. \ No newline at end of file diff --git a/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series b/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series new file mode 100644 index 0000000000..b71824f585 --- /dev/null +++ b/data/2024/aaai/CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series @@ -0,0 +1 @@ +Urban transformations have profound societal impact on both individuals and communities at large. Accurately assessing these shifts is essential for understanding their underlying causes and ensuring sustainable urban planning. Traditional measurements often encounter constraints in spatial and temporal granularity, failing to capture real-time physical changes. While street view imagery, capturing the heartbeat of urban spaces in a pedestrian point of view, can add as a high-definition, up-to-date, and on-the-ground visual proxy of urban change. We curate the largest street view time series dataset to date, and propose an end-to-end change detection model to effectively capture physical alterations in the built environment at scale. We demonstrate the effectiveness of our proposed method by benchmark comparisons with previous literature and implementing it at the city-wide level. Our approach has the potential to supplement existing dataset and serve as a fine-grained and accurate assessment of urban change. \ No newline at end of file diff --git a/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training b/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training new file mode 100644 index 0000000000..8f9ea93192 --- /dev/null +++ b/data/2024/aaai/Clarifying the Behavior and the Difficulty of Adversarial Training @@ -0,0 +1 @@ +Adversarial training is usually difficult to optimize. This paper provides conceptual and analytic insights into the difficulty of adversarial training via a simple theoretical study, where we derive an approximate dynamics of a recursive multi-step attack in a simple setting. Despite the simplicity of our theory, it still reveals verifiable predictions about various phenomena in adversarial training under real-world settings. First, compared to vanilla training, adversarial training is more likely to boost the influence of input samples with large gradient norms in an exponential manner. Besides, adversarial training also strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which is more likely to make network parameters oscillate and boosts the difficulty of adversarial training. \ No newline at end of file diff --git a/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective b/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective new file mode 100644 index 0000000000..c79992fd80 --- /dev/null +++ b/data/2024/aaai/Class-Attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective @@ -0,0 +1 @@ +Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g.~hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities. \ No newline at end of file diff --git a/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract) b/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract) new file mode 100644 index 0000000000..4ad5b6c930 --- /dev/null +++ b/data/2024/aaai/Cluster-Based Sampling in Hindsight Experience Replay for Robotic Tasks (Student Abstract) @@ -0,0 +1 @@ +In multi-goal reinforcement learning with a sparse binary reward, training agents is particularly challenging, due to a lack of successful experiences. To solve this problem, hindsight experience replay (HER) generates successful experiences even from unsuccessful ones. However, generating successful experiences from uniformly sampled ones is not an efficient process. In this paper, the impact of exploiting the property of achieved goals in generating successful experiences is investigated and a novel cluster-based sampling strategy is proposed. The proposed sampling strategy groups episodes with different achieved goals by using a cluster model and samples experiences in the manner of HER to create the training batch. The proposed method is validated by experiments with three robotic control tasks of the OpenAI Gym. The results of experiments demonstrate that the proposed method is substantially sample efficient and achieves better performance than baseline approaches. \ No newline at end of file diff --git a/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers b/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers new file mode 100644 index 0000000000..b667cfc4b6 --- /dev/null +++ b/data/2024/aaai/Co-designing AI Education Curriculum with Cross-Disciplinary High School Teachers @@ -0,0 +1 @@ +High school teachers from many disciplines have growing interests in teaching about artificial intelligence (AI). This cross-disciplinary interest reflects the prevalence of AI tools across society, such as Generative AI tools built upon Large Language Models (LLM). However, high school classes are unique and complex environments, led by teachers with limited time and resources with priorities that vary by class and the students they serve. Therefore, developing curricula about AI for classes that span many disciplines (e.g. history, art, math) must involve centering the expertise of cross-disciplinary teachers. In this study, we conducted five collaborative curricular co-design sessions with eight teachers who taught high school humanities and STEM classes. We sought to understand how teachers considered AI when it was taught in art, math, and social studies contexts, as well as opportunities and challenges they identified with incorporating AI tools into their instruction. We found that teachers considered technical skills and ethical debates around AI, opportunities for "dual exploration" between AI and disciplinary learning, and limitations of AI tools as supporting engagement and reflection but also potentially distracting. We interpreted our findings relative to co-designing adaptable AI curricula to support teaching about and with AI across high school disciplines. \ No newline at end of file diff --git a/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification b/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification new file mode 100644 index 0000000000..ae7d56f68e --- /dev/null +++ b/data/2024/aaai/CoLAL: Co-learning Active Learning for Text Classification @@ -0,0 +1 @@ +In the machine learning field, the challenge of effectively learning with limited data has become increasingly crucial. Active Learning (AL) algorithms play a significant role in this by enhancing model performance. We introduce a novel AL algorithm, termed Co-learning (CoLAL), designed to select the most diverse and representative samples within a training dataset. This approach utilizes noisy labels and predictions made by the primary model on unlabeled data. By leveraging a probabilistic graphical model, we combine two multi-class classifiers into a binary one. This classifier determines if both the main and the peer models agree on a prediction. If they do, the unlabeled sample is assumed to be easy to classify and is thus not beneficial to increase the target model's performance. We prioritize data that represents the unlabeled set without overlapping decision boundaries. The discrepancies between these boundaries can be estimated by the probability that two models result in the same prediction. Through theoretical analysis and experimental validation, we reveal that the integration of noisy labels into the peer model effectively identifies target model's potential inaccuracies. We evaluated the CoLAL method across seven benchmark datasets: four text datasets (AGNews, DBPedia, PubMed, SST-2) and text-based state-of-the-art (SOTA) baselines, and three image datasets (CIFAR100, MNIST, OpenML-155) and computer vision SOTA baselines. The results show that our CoLAL method significantly outperforms existing SOTA in text-based AL, and is competitive with SOTA image-based AL techniques. \ No newline at end of file diff --git a/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding b/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding new file mode 100644 index 0000000000..ace61af228 --- /dev/null +++ b/data/2024/aaai/CoPL: Contextual Prompt Learning for Vision-Language Understanding @@ -0,0 +1,2 @@ +Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to +the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features. \ No newline at end of file diff --git a/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding b/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding new file mode 100644 index 0000000000..0f6cd10a45 --- /dev/null +++ b/data/2024/aaai/CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding @@ -0,0 +1 @@ +This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task. \ No newline at end of file diff --git a/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions b/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions new file mode 100644 index 0000000000..8b9f6722f0 --- /dev/null +++ b/data/2024/aaai/CoVR: Learning Composed Video Retrieval from Web Video Captions @@ -0,0 +1 @@ +Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr. \ No newline at end of file diff --git a/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract) b/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract) new file mode 100644 index 0000000000..6f790cbfd0 --- /dev/null +++ b/data/2024/aaai/Coalition Formation for Task Allocation Using Multiple Distance Metrics (Student Abstract) @@ -0,0 +1 @@ +Simultaneous Coalition Structure Generation and Assignment (SCSGA) is an important research problem in multi-agent systems. Given n agents and m tasks, the aim of SCSGA is to form m disjoint coalitions of n agents such that between the coalitions and tasks there is a one-to-one mapping, which ensures each coalition is capable of accomplishing the assigned task. SCSGA with Multi-dimensional Features (SCSGA-MF) extends the problem by introducing a d-dimensional vector for each agent and task. We propose a heuristic algorithm called Multiple Distance Metric (MDM) approach to solve SCSGA-MF. Experimental results confirm that MDM produces near optimal solutions, while being feasible for large-scale inputs within a reasonable time frame. \ No newline at end of file diff --git a/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering b/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering new file mode 100644 index 0000000000..b5ed9fc674 --- /dev/null +++ b/data/2024/aaai/Code-Style In-Context Learning for Knowledge-Based Question Answering @@ -0,0 +1 @@ +Current methods for Knowledge-Based Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting. The code and supplementary files are released at https://github.com/Arthurizijar/KB-Coder. \ No newline at end of file diff --git a/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret b/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret new file mode 100644 index 0000000000..8855985627 --- /dev/null +++ b/data/2024/aaai/Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret @@ -0,0 +1 @@ +In recent years, there has been growing interest in developing robust machine learning (ML) models that can withstand adversarial attacks, including one of the most widely adopted, efficient, and interpretable ML algorithms—decision trees (DTs). This paper proposes a novel coevolutionary algorithm (CoEvoRDT) designed to create robust DTs capable of handling noisy high-dimensional data in adversarial contexts. Motivated by the limitations of traditional DT algorithms, we leverage adaptive coevolution to allow DTs to evolve and learn from interactions with perturbed input data. CoEvoRDT alternately evolves competing populations of DTs and perturbed features, enabling construction of DTs with desired properties. CoEvoRDT is easily adaptable to various target metrics, allowing the use of tailored robustness criteria such as minimax regret. Furthermore, CoEvoRDT has potential to improve the results of other state-of-the-art methods by incorporating their outcomes (DTs they produce) into the initial population and optimize them in the process of coevolution. Inspired by the game theory, CoEvoRDT utilizes mixed Nash equilibrium to enhance convergence. The method is tested on 20 popular datasets and shows superior performance compared to 4 state-of-the-art algorithms. It outperformed all competing methods on 13 datasets with adversarial accuracy metrics, and on all 20 considered datasets with minimax regret. Strong experimental results and flexibility in choosing the error measure make CoEvoRDT a promising approach for constructing robust DTs in real-world applications. \ No newline at end of file diff --git a/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field b/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field new file mode 100644 index 0000000000..b377a5ead4 --- /dev/null +++ b/data/2024/aaai/ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field @@ -0,0 +1 @@ +Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes the cooperation among sparse input source images and the cooperation among the output of the NeRF. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes self-supervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Our extensive experimental results demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeRF. \ No newline at end of file diff --git a/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning b/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning new file mode 100644 index 0000000000..f085cfcd72 --- /dev/null +++ b/data/2024/aaai/Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning @@ -0,0 +1 @@ +Open-World Few-Shot Learning (OFSL) is a crucial research field dedicated to accurately identifying target samples in scenarios where data is limited and labels are unreliable. This research holds significant practical implications and is highly relevant to real-world applications. Recently, the advancements in foundation models like CLIP and DINO have showcased their robust representation capabilities even in resource-constrained settings with scarce data. This realization has brought about a transformative shift in focus, moving away from “building models from scratch” towards “effectively harnessing the potential of foundation models to extract pertinent prior knowledge suitable for OFSL and utilizing it sensibly”. Motivated by this perspective, we introduce the Collaborative Consortium of Foundation Models (CO3), which leverages CLIP, DINO, GPT-3, and DALL-E to collectively address the OFSL problem. CO3 comprises four key blocks: (1) the Label Correction Block (LC-Block) corrects unreliable labels, (2) the Data Augmentation Block (DA-Block) enhances available data, (3) the Feature Extraction Block (FE-Block) extracts multi-modal features, and (4) the Text-guided Fusion Adapter (TeFu-Adapter) integrates multiple features while mitigating the impact of noisy labels through semantic constraints. Only the adapter's parameters are adjustable, while the others remain frozen. Through collaboration among these foundation models, CO3 effectively unlocks their potential and unifies their capabilities to achieve state-of-the-art performance on multiple benchmark datasets. https://github.com/The-Shuai/CO3. \ No newline at end of file diff --git a/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models b/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models new file mode 100644 index 0000000000..a766ab47ba --- /dev/null +++ b/data/2024/aaai/Collaborative Learning across Heterogeneous Systems with Pre-Trained Models @@ -0,0 +1 @@ +The increasingly decentralized and private nature of data in our digital society has motivated the development of personalized, collaborative intelligent systems that enable knowledge aggregation across multiple data owners while accommodating for their data privacy and system constraints. However, collaborative learning has only been investigated in simple and limited settings: isolated task scenarios where learning begins from scratch and does not build on prior expertise; learned model is represented in task-specific forms which are not generalizable to unseen, emerging scenarios; and more often, a universal model representation is assumed across collaborators, ignoring their local compute constraints or input representations. This restricts its practicality in continual learning scenarios with limited task data, which demand continuous adaptation and knowledge transfer across different information silos, tasks, and learning models, as well as the utilization of prior solution expertises. To overcome these limitations, my research has been focused on developing effective and scalable resource-aware collaborative learning frameworks across heterogeneous systems. \ No newline at end of file diff --git a/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference b/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference new file mode 100644 index 0000000000..1fc1ba3936 --- /dev/null +++ b/data/2024/aaai/Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference @@ -0,0 +1 @@ +Electronic health records (EHRs) have become the foundation of machine learning applications in healthcare, while the utility of real patient records is often limited by privacy and security concerns. Synthetic EHR generation provides an additional perspective to compensate for this limitation. Most existing methods synthesize new records based on real EHR data, without consideration of different types of events in EHR data, which cannot control the event combinations in line with medical common sense. In this paper, we propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR synthesis to address these limitations. First, we formulate the synthetic EHR generation process as a probabilistic graphical model and tightly connect different types of events by modeling the latent health states. Then, we derive a health state inference method tailored for the multi-visit scenario to effectively utilize previous records to synthesize current and future records. Furthermore, we propose to generate medical reports to add textual descriptions for each medical event, providing broader applications for synthesized EHR data. For generating different paragraphs in each visit, we incorporate a multi-generator deliberation framework to collaborate the message passing of multiple generators and employ a two-phase decoding strategy to generate high-quality reports. Our extensive experiments on the widely used benchmarks, MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results on the quality of synthetic data while maintaining low privacy risks. \ No newline at end of file diff --git a/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics b/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics new file mode 100644 index 0000000000..9b54f76846 --- /dev/null +++ b/data/2024/aaai/Collaborative Tooth Motion Diffusion Model in Digital Orthodontics @@ -0,0 +1,9 @@ +Tooth motion generation is an essential task in digital orthodontic treatment for precise and quick dental healthcare, which aims to generate the whole intermediate tooth motion process given the initial pathological and target ideal tooth alignments. +Most prior works for multi-agent motion planning problems usually result in complex solutions. +Moreover, the occlusal relationship between upper and lower teeth is often overlooked. +In this paper, we propose a collaborative tooth motion diffusion model. +The critical insight is to remodel the problem as a diffusion process. +In this sense, we model the whole tooth motion distribution with a diffusion model and transform the planning problem into a sampling process from this distribution. +We design a tooth latent representation to provide accurate conditional guides consisting of two key components: the tooth frame represents the position and posture, and the tooth latent shape code represents the geometric morphology. +Subsequently, we present a collaborative diffusion model to learn the multi-tooth motion distribution based on inter-tooth and occlusal constraints, which are implemented by graph structure and new loss functions, respectively. +Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in the application of orthodontics compared with state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis b/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis new file mode 100644 index 0000000000..1855495122 --- /dev/null +++ b/data/2024/aaai/Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis @@ -0,0 +1 @@ +Video Correlation Learning (VCL), which aims to analyze the relationships between videos, has been widely studied and applied in various general video tasks. However, applying VCL to instructional videos is still quite challenging due to their intrinsic procedural temporal structure. Specifically, procedural knowledge is critical for accurate correlation analyses on instructional videos. Nevertheless, current procedure-learning methods heavily rely on step-level annotations, which are costly and not scalable. To address this problem, we introduce a weakly supervised framework called Collaborative Procedure Alignment (CPA) for procedure-aware correlation learning on instructional videos. Our framework comprises two core modules: collaborative step mining and frame-to-step alignment. The collaborative step mining module enables simultaneous and consistent step segmentation for paired videos, leveraging the semantic and temporal similarity between frames. Based on the identified steps, the frame-to-step alignment module performs alignment between the frames and steps across videos. The alignment result serves as a measurement of the correlation distance between two videos. We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment. Extensive experiments validate the effectiveness of our approach in providing accurate and interpretable correlation analyses for instructional videos. \ No newline at end of file diff --git a/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging b/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging new file mode 100644 index 0000000000..da7f65d0b6 --- /dev/null +++ b/data/2024/aaai/Color Event Enhanced Single-Exposure HDR Imaging @@ -0,0 +1,27 @@ +Single-exposure high dynamic range (HDR) imaging aims +to reconstruct the wide-range intensities of a scene by using +its single low dynamic range (LDR) image, thus providing +significant efficiency. Existing methods pay high attention to +restoring the luminance by inversing the tone-mapping process, +while the color in the over-/under-exposed area cannot +be well restored due to the information loss of the single +LDR image. To address this issue, we introduce color +events into the imaging pipeline, which record asynchronous +pixel-wise color changes in a high dynamic range, enabling +edge-like scene perception under challenging lighting conditions. +Specifically, we propose a joint framework that incorporates +color events and a single LDR image to restore +both content and color of an HDR image, where an exposureaware +transformer (EaT) module is designed to propagate the +informative hints, provided by the normal-exposed LDR regions +and the event streams, to the missing areas. In this +module, an exposure-aware mask is estimated to suppress +distractive information and strengthen the restoration of the +over-/under-exposed regions. To our knowledge, we are the +first to use color events to enhance single-exposure HDR +imaging. We also contribute corresponding datasets, consisting +of synthesized datasets and a real-world dataset collected +by a DAVIS346-color camera. The datasets can be found at +https://www.kaggle.com/datasets/mengyaocui/ce-hdr. Extensive +experiments demonstrate the effectiveness of the proposed +method. \ No newline at end of file diff --git a/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling b/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling new file mode 100644 index 0000000000..84e5908657 --- /dev/null +++ b/data/2024/aaai/Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling @@ -0,0 +1 @@ +Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO. \ No newline at end of file diff --git a/data/2024/aaai/Colorizing Monochromatic Radiance Fields b/data/2024/aaai/Colorizing Monochromatic Radiance Fields new file mode 100644 index 0000000000..f270bf1c81 --- /dev/null +++ b/data/2024/aaai/Colorizing Monochromatic Radiance Fields @@ -0,0 +1 @@ +Though Neural Radiance Fields (NeRF) can produce colorful 3D representations of the world by using a set of 2D images, such ability becomes non-existent when only monochromatic images are provided. Since color is necessary in representing the world, reproducing color from monochromatic radiance fields becomes crucial. To achieve this goal, instead of manipulating the monochromatic radiance fields directly, we consider it as a representation-prediction task in the Lab color space. By first constructing the luminance and density representation using monochromatic images, our prediction stage can recreate color representation on the basis of an image colorization module. We then reproduce a colorful implicit model through the representation of luminance, density, and color. Extensive experiments have been conducted to validate the effectiveness of our approaches. Our project page: https://liquidammonia.github.io/color-nerf. \ No newline at end of file diff --git a/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors b/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors new file mode 100644 index 0000000000..bde6aab26a --- /dev/null +++ b/data/2024/aaai/Colour Passing Revisited: Lifted Model Construction with Commutative Factors @@ -0,0 +1 @@ +Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes. To apply lifted inference, a lifted representation has to be obtained, and to do so, the so-called colour passing algorithm is the state of the art. The colour passing algorithm, however, is bound to a specific inference algorithm and we found that it ignores commutativity of factors while constructing a lifted representation. We contribute a modified version of the colour passing algorithm that uses logical variables to construct a lifted representation independent of a specific inference algorithm while at the same time exploiting commutativity of factors during an offline-step. Our proposed algorithm efficiently detects more symmetries than the state of the art and thereby drastically increases compression, yielding significantly faster online query times for probabilistic inference when the resulting model is applied. \ No newline at end of file diff --git a/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators b/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators new file mode 100644 index 0000000000..bbd42d1744 --- /dev/null +++ b/data/2024/aaai/Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators @@ -0,0 +1 @@ +Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11% on CIFAR-10 and CINIC-10 datasets. \ No newline at end of file diff --git a/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation b/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation new file mode 100644 index 0000000000..4539670c91 --- /dev/null +++ b/data/2024/aaai/Combating Insider Threat in the Open-World Environments: Identification, Monitoring, and Data Augmentation @@ -0,0 +1 @@ +Recent years have witnessed a dramatic increase in a class of security threats known as "insider threats". These threats occur when individuals with authorized access to an organization's network engage in harmful activities, potentially leading to the disclosure of vital information or adversely affecting the organization's systems (e.g., financial loss, system crashes, and national security challenges). Distinct from other types of terror attacks, combating insider threats exhibits several unique challenges, including (1) rarity, (2) non-separability, (3) label scarcity, (4) dynamics, and (5) heterogeneity, making themselves extremely difficult to identify and mitigate. We target the challenging problem of combating insider threats in open-world environments by leveraging a variety of data sources (e.g., internal system logs, employee networks, human trafficking, and smuggling networks). To effectively combat these intricate threats, we introduce an interactive learning mechanism that is composed of three mutually beneficial learning modules: insider identification, insider monitoring, and data augmentation. Each module plays a crucial role in enhancing our ability to detect and mitigate insider threats, thereby contributing to a more secure and resilient organizational environment. \ No newline at end of file diff --git a/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation b/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation new file mode 100644 index 0000000000..fc0f826816 --- /dev/null +++ b/data/2024/aaai/Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation @@ -0,0 +1,7 @@ +Semi-supervised learning (SSL), as one of the dominant methods, aims at leveraging the unlabeled data to deal with the annotation dilemma of supervised learning, which has attracted much attentions in the medical image segmentation. +Most of the existing approaches leverage a unitary network by convolutional neural networks (CNNs) with compulsory consistency of the predictions through small perturbations applied to inputs or models. +The penalties of such a learning paradigm are that (1) CNN-based models place severe limitations on global learning; (2) rich and diverse class-level distributions are inhibited. +In this paper, we present a novel CNN-Transformer learning framework in the manifold space for semi-supervised medical image segmentation. +First, at intra-student level, we propose a novel class-wise consistency loss to facilitate the learning of both discriminative and compact target feature representations. +Then, at inter-student level, we align the CNN and Transformer features using a prototype-based optimal transport method. +Extensive experiments show that our method outperforms previous state-of-the-art methods on three public medical image segmentation benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit b/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit new file mode 100644 index 0000000000..664dcfc4c1 --- /dev/null +++ b/data/2024/aaai/Combinatorial Stochastic-Greedy Bandit @@ -0,0 +1 @@ +We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of n arms at each time step t in [T] is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a (1-1/e)-regret bound of O(n^(1/3) k^(2/3) T^(2/3) log(T)^(2/3)) for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint k. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as k grows. \ No newline at end of file diff --git a/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types b/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types new file mode 100644 index 0000000000..68bffd2981 --- /dev/null +++ b/data/2024/aaai/Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types @@ -0,0 +1,3 @@ +Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies. To date, however, crop type maps remain challenging to create in low- and middle-income countries due to a lack of ground truth labels for training machine learning models. Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity. +In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world. Such imagery contains rich information about crop types grown at particular locations and times. +In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery. The method efficiently curates a set of street-view images containing crop fields, trains a model to predict crop types using either weakly-labeled images from disparate out-of-domain sources or zero-shot labeled street view images with GPT-4V, and combines the predicted labels with remote sensing time series to create a wall-to-wall crop type map. We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%. We publicly release the first-ever crop type map for all of Thailand 2022 at 10m-resolution with no gaps. To our knowledge, this is the first time a 10m-resolution, multi-crop map has been created for any smallholder country. As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions. \ No newline at end of file diff --git a/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification b/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification new file mode 100644 index 0000000000..56ef24e059 --- /dev/null +++ b/data/2024/aaai/Combining Graph Transformers Based Multi-Label Active Learning and Informative Data Augmentation for Chest Xray Classification @@ -0,0 +1 @@ +Informative sample selection in active learning (AL) helps a machine learning system attain optimum performance with minimum labeled samples, thus improving human-in-the-loop computer-aided diagnosis systems with limited labeled data. Data augmentation is highly effective for enlarging datasets with less labeled data. Combining informative sample selection and data augmentation should leverage their respective advantages and improve performance of AL systems. We propose a novel approach to combine informative sample selection and data augmentation for multi-label active learning. Conventional informative sample selection approaches have mostly focused on the single-label case which do not perform optimally in the multi-label setting. We improve upon state-of-the-art multi-label active learning techniques by representing disease labels as graph nodes, use graph attention transformers (GAT) to learn more effective inter-label relationships and identify most informative samples. We generate transformations of these informative samples which are also informative. Experiments on public chest xray datasets show improved results over state-of-the-art multi-label AL techniques in terms of classification performance, learning rates, and robustness. We also perform qualitative analysis to determine the realism of generated images. \ No newline at end of file diff --git a/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management b/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management new file mode 100644 index 0000000000..fb9b6ca60b --- /dev/null +++ b/data/2024/aaai/Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management @@ -0,0 +1 @@ +Incarceration-diversion programs have proven effective in reducing recidivism. Accurate prediction of the number of individuals with different characteristics in the program and their program outcomes based on given eligibility criteria is crucial for successful implementation, because this prediction serves as the foundation for determining the appropriate program size and the consequent staffing requirements. However, this task poses challenges due to the complexities arising from varied outcomes and lengths-of-stay for the diverse individuals in incarceration-diversion programs. In collaboration with an Illinois government agency, we develop a framework to address these issues. Our framework combines ML and queueing model simulation, providing accurate predictions for the program census and interpretable insights into program dynamics and the impact of different decisions in counterfactual scenarios. Additionally, we deploy a user-friendly web app beta-version that allows program managers to visualize census data by counties and race groups. We showcase two decision support use cases: Changing program admission criteria and launching similar programs in new counties. \ No newline at end of file diff --git a/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval b/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval new file mode 100644 index 0000000000..624cba8da0 --- /dev/null +++ b/data/2024/aaai/Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval @@ -0,0 +1,8 @@ +Recently, dense retrieval (DR) models, which represent queries and documents with fixed-width vectors and retrieve relevant ones via nearest neighbor search, have drawn increasing attention from the IR community. +However, previous studies have shown that the effectiveness of DR critically relies on sufficient training signals, which leads to severe performance degradation when applied in out-of-domain scenarios, where large-scale training data are usually unavailable. +To solve this problem, existing studies adopt a data-augmentation-plus-joint-training paradigm to construct weak/pseudo supervisions on the target domain and combine them with the large-scale human annotated data on the source domain to train the DR models. However, they don't explicitly distinguish the data and the supervision signals in the training process and simply assume that the DR models are mighty enough to capture and memorize different domain knowledge and relevance matching patterns without guidance, which, as shown in this paper, is not true. +Based on this observation, we propose a Robust Multi-Supervision Combining strategy (RMSC) that +decouples the domain and supervision signals by explicitly telling the DR models how the domain data and supervision signals are combined in the training data with specially designed soft tokens. +With the extra soft tokens to store the domain-specific and supervision-specific knowledge, RMSC allows the DR models +to conduct retrieval based on human-like relevance matching patterns and target-specific language distribution on the target domain without human annotations. +Extensive experiments on zero-shot DR benchmarks show that RMSC significantly improves the ranking performance on the target domain compared to strong DR baselines and domain adaptation methods, while being stable during training and can be combined with query generation or second-stage pre-training. \ No newline at end of file diff --git a/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization b/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization new file mode 100644 index 0000000000..875a479c0d --- /dev/null +++ b/data/2024/aaai/Commonsense for Zero-Shot Natural Language Video Localization @@ -0,0 +1 @@ +Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL. \ No newline at end of file diff --git a/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks b/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks new file mode 100644 index 0000000000..a57aac48da --- /dev/null +++ b/data/2024/aaai/Communication Efficient Distributed Newton Method over Unreliable Networks @@ -0,0 +1 @@ +Distributed optimization in resource constrained devices demands both communication efficiency and fast convergence rates. Newton-type methods are getting preferable due to their superior convergence rates compared to the first-order methods. In this paper, we study a new problem in regard to the second-order distributed optimization over unreliable networks. The working devices are power-limited or operate in unfavorable wireless channels, experiencing packet losses during their uplink transmission to the server. Our scenario is very common in real-world and leads to instability of classical distributed optimization methods especially the second-order methods because of their sensitivity to the imprecision of local Hessian matrices. To achieve robustness to high packet loss, communication efficiency and fast convergence rates, we propose a novel distributed second-order method, called RED-New (Packet loss Resilient Distributed Approximate Newton). Each iteration of RED-New comprises two rounds of light-weight and lossy transmissions, in which the server aggregates the local information with a new developed scaling strategy. We prove the linear-quadratic convergence rate of RED-New. Experimental results demonstrate its advantage over first-order and second-order baselines, and its tolerance to packet loss rate ranging from 5% to 40%. \ No newline at end of file diff --git a/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits b/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits new file mode 100644 index 0000000000..2bf75895f9 --- /dev/null +++ b/data/2024/aaai/Communication-Efficient Collaborative Regret Minimization in Multi-Armed Bandits @@ -0,0 +1 @@ +In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multi-agent multi-armed bandits. For regret minimization in multi-armed bandits, we present the first set of tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process. \ No newline at end of file diff --git a/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer b/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer new file mode 100644 index 0000000000..741cc8e75a --- /dev/null +++ b/data/2024/aaai/Compact HD Map Construction via Douglas-Peucker Point Transformer @@ -0,0 +1 @@ +High-definition (HD) map construction requires a comprehensive understanding of traffic environments, encompassing centimeter-level localization and rich semantic information. Previous works face challenges in redundant point representation or high-complexity curve modeling. In this paper, we present a flexible yet effective map element detector that synthesizes hierarchical information with a compact Douglas-Peucker (DP) point representation in a transformer architecture for robust and reliable predictions. Specifically, our proposed representation approximates class-agnostic map elements with DP points, which are sparsely located in crucial positions of structures and can get rid of redundancy and complexity. Besides, we design a position constraint with uncertainty to avoid potential ambiguities. Moreover, pairwise-point shape matching constraints are proposed to balance local structural information of different scales. Experiments on the public nuScenes dataset demonstrate that our method overwhelms current SOTAs. Extensive ablation studies validate each component of our methods. Codes will be released at https://github.com/sweety121/DPFormer. \ No newline at end of file diff --git a/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning b/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning new file mode 100644 index 0000000000..03e8e4e22e --- /dev/null +++ b/data/2024/aaai/Complementary Knowledge Distillation for Robust and Privacy-Preserving Model Serving in Vertical Federated Learning @@ -0,0 +1 @@ +Vertical Federated Learning (VFL) enables an active party with labeled data to enhance model performance (utility) by collaborating with multiple passive parties that possess auxiliary features corresponding to the same sample identifiers (IDs). Model serving in VFL is vital for real-world, delay-sensitive applications, and it faces two major challenges: 1) robustness against arbitrarily-aligned data and stragglers; and 2) privacy protection, ensuring minimal label leakage to passive parties. Existing methods fail to transfer knowledge among parties to improve robustness in a privacy-preserving way. In this paper, we introduce a privacy-preserving knowledge transfer framework, Complementary Knowledge Distillation (CKD), designed to enhance the robustness and privacy of multi-party VFL systems. Specifically, we formulate a Complementary Label Coding (CLC) objective to encode only complementary label information of the active party's local model for passive parties to learn. Then, CKD selectively transfers the CLC-encoded complementary knowledge 1) from the passive parties to the active party, and 2) among the passive parties themselves. Experimental results on four real-world datasets demonstrate that CKD outperforms existing approaches in terms of robustness against arbitrarily-aligned data, while also minimizing label privacy leakage. \ No newline at end of file diff --git a/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs b/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs new file mode 100644 index 0000000000..5d81c867a1 --- /dev/null +++ b/data/2024/aaai/Complete Neural Networks for Complete Euclidean Graphs @@ -0,0 +1 @@ +Neural networks for point clouds, which respect their natural invariance to permutation and rigid motion, have enjoyed recent success in modeling geometric phenomena, from molecular dynamics to recommender systems. Yet, to date, no architecture with polynomial complexity is known to be complete, that is, able to distinguish between any pair of non-isomorphic point clouds. We fill this theoretical gap by showing that point clouds can be completely determined, up to permutation and rigid motion, by applying the 3-WL graph isomorphism test to the point cloud's centralized Gram matrix. Moreover, we formulate an Euclidean variant of the 2-WL test and show that it is also sufficient to achieve completeness. We then show how our complete Euclidean WL tests can be simulated by an Euclidean graph neural network of moderate size and demonstrate their separation capability on highly symmetrical point clouds. \ No newline at end of file diff --git a/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting b/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting new file mode 100644 index 0000000000..2a4d68ea50 --- /dev/null +++ b/data/2024/aaai/Completing Priceable Committees: Utilitarian and Representation Guarantees for Proportional Multiwinner Voting @@ -0,0 +1 @@ +When selecting committees based on preferences of voters, a variety of different criteria can be considered. Two natural objectives are maximizing the utilitarian welfare (the sum of voters' utilities) and coverage (the number of represented voters) of the selected committee. Previous work has studied the impact on utilitarian welfare and coverage when requiring the committee to satisfy minimal requirements such as justified representation or weak proportionality. In this paper, we consider the impact of imposing much more demanding proportionality axioms. We identify a class of voting rules that achieve strong guarantees on utilitarian welfare and coverage when combined with appropriate completions. This class is defined via a weakening of priceability and contains prominent rules such as the Method of Equal Shares. We show that committees selected by these rules (i) can be completed to achieve optimal coverage and (ii) can be completed to achieve an asymptotically optimal approximation to the utilitarian welfare if they additionally satisfy EJR+. Answering an open question of Elkind et al. (2022), we use the Greedy Justified Candidate Rule to obtain the best possible utilitarian guarantee subject to proportionality. We also consider completion methods suggested in the participatory budgeting literature and other objectives besides welfare and coverage. \ No newline at end of file diff --git a/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework b/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework new file mode 100644 index 0000000000..dffcbd1d9b --- /dev/null +++ b/data/2024/aaai/Complexity of Credulous and Skeptical Acceptance in Epistemic Argumentation Framework @@ -0,0 +1 @@ +Dung’s Argumentation Framework (AF) has been extended in several directions. Among the numerous proposed extensions, three of them seem to be of particular interest and have correlations between them. These extensions are: constrained AF (CAF), where AF is augmented with (strong) constraints; epistemic AF (EAF), where AF is augmented with epistemic constraints; and incomplete AF (iAF), where arguments and attacks can be uncertain. While the complexity and expressiveness of CAF and iAF have been studied, that of EAF has not been explored so far. In this paper we investigate the complexity and expressivity of EAF. To this end, we first introduce the Labeled CAF (LCAF), a variation of CAF where constraints are defined over the alphabet of labeled arguments. Then, we investigate the complexity of credulous and skeptical reasoning and show that: i) EAF is more expressive than iAF (under preferred semantics), ii) although LCAF is a restriction of EAF where modal operators are not allowed, these frameworks have the same complexity, iii) the results for LCAF close a gap in the characterization of the complexity of CAF. Interestingly, even though EAF has the same complexity as LCAF, it allows modeling domain knowledge in a more natural and easy-to-understand way. \ No newline at end of file diff --git a/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations b/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations new file mode 100644 index 0000000000..0ededecae9 --- /dev/null +++ b/data/2024/aaai/Component Fourier Neural Operator for Singularly Perturbed Differential Equations @@ -0,0 +1 @@ +Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this paper, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations. \ No newline at end of file diff --git a/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae b/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae new file mode 100644 index 0000000000..0b6da97521 --- /dev/null +++ b/data/2024/aaai/Composing Biases by Using CP to Decompose Minimal Functional Dependencies for Acquiring Complex Formulae @@ -0,0 +1 @@ +Given a table with a minimal set of input columns that functionally determines an output column, we introduce a method that tries to gradually decompose the corresponding minimal functional dependency (mfd) to acquire a formula expressing the output column in terms of the input columns. A first key element of the method is to create sub-problems that are easier to solve than the original formula acquisition problem, either because it learns formulae with fewer inputs parameters, or as it focuses on formulae of a particular class, such as Boolean formulae; as a result, the acquired formulae can mix different learning biases such as polynomials, conditionals or Boolean expressions. A second key feature of the method is that it can be applied recursively to find formulae that combine polynomial, conditional or Boolean sub-terms in a nested manner. The method was tested on data for eight families of combinatorial objects; new conjectures were found that were previously unattainable. The method often creates conjectures that combine several formulae into one with a limited number of automatically found Boolean terms. \ No newline at end of file diff --git a/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions b/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions new file mode 100644 index 0000000000..14acd3ade3 --- /dev/null +++ b/data/2024/aaai/Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions @@ -0,0 +1 @@ +Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ~2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNet (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir. \ No newline at end of file diff --git a/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach b/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach new file mode 100644 index 0000000000..50791c2fb2 --- /dev/null +++ b/data/2024/aaai/Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach @@ -0,0 +1 @@ +Despite significant advancements in multi-label text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text classification models. Our results show that these models often fail to generalize to compositional concepts encountered infrequently during training, leading to inferior performance on tests with these new combinations. To address this, we introduce a data augmentation method that leverages two innovative text generation models designed to enhance the classification models' capacity for compositional generalization. Our experiments show that this data augmentation approach significantly improves the compositional generalization capabilities of classification models on our benchmarks, with both generation models surpassing other text generation baselines. Our codes available at https://github.com/yychai74/LD-VAE. \ No newline at end of file diff --git a/data/2024/aaai/Compositional Inversion for Stable Diffusion Models b/data/2024/aaai/Compositional Inversion for Stable Diffusion Models new file mode 100644 index 0000000000..98881cb2ab --- /dev/null +++ b/data/2024/aaai/Compositional Inversion for Stable Diffusion Models @@ -0,0 +1 @@ +Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a post-training approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion. \ No newline at end of file diff --git a/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models b/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models new file mode 100644 index 0000000000..a9d53fc4e2 --- /dev/null +++ b/data/2024/aaai/Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models @@ -0,0 +1 @@ +Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPO-Mente-Lab/attention-mask-control. \ No newline at end of file diff --git a/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues b/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues new file mode 100644 index 0000000000..561eacf9b0 --- /dev/null +++ b/data/2024/aaai/Compound Text-Guided Prompt Tuning via Image-Adaptive Cues @@ -0,0 +1 @@ +Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T. \ No newline at end of file diff --git a/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration b/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration new file mode 100644 index 0000000000..0ea1d5dd72 --- /dev/null +++ b/data/2024/aaai/Comprehensive View Embedding Learning for Single-Cell Multimodal Integration @@ -0,0 +1 @@ +Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration. \ No newline at end of file diff --git a/data/2024/aaai/Comprehensive Visual Grounding for Video Description b/data/2024/aaai/Comprehensive Visual Grounding for Video Description new file mode 100644 index 0000000000..572b5f1649 --- /dev/null +++ b/data/2024/aaai/Comprehensive Visual Grounding for Video Description @@ -0,0 +1 @@ +The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold b/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold new file mode 100644 index 0000000000..242b07f031 --- /dev/null +++ b/data/2024/aaai/Compressing Image-to-Image Translation GANs Using Local Density Structures on Their Learned Manifold @@ -0,0 +1 @@ +Generative Adversarial Networks (GANs) have shown remarkable success in modeling complex data distributions for image-to-image translation. Still, their high computational demands prohibit their deployment in practical scenarios like edge devices. Existing GAN compression methods mainly rely on knowledge distillation or convolutional classifiers' pruning techniques. Thus, they neglect the critical characteristic of GANs: their local density structure over their learned manifold. Accordingly, we approach GAN compression from a new perspective by explicitly encouraging the pruned model to preserve the density structure of the original parameter-heavy model on its learned manifold. We facilitate this objective for the pruned model by partitioning the learned manifold of the original generator into local neighborhoods around its generated samples. Then, we propose a novel pruning objective to regularize the pruned model to preserve the local density structure over each neighborhood, resembling the kernel density estimation method. Also, we develop a collaborative pruning scheme in which the discriminator and generator are pruned by two pruning agents. We design the agents to capture interactions between the generator and discriminator by exchanging their peer's feedback when determining corresponding models' architectures. Thanks to such a design, our pruning method can efficiently find performant sub-networks and can maintain the balance between the generator and discriminator more effectively compared to baselines during pruning, thereby showing more stable pruning dynamics. Our experiments on image translation GAN models, Pix2Pix and CycleGAN, with various benchmark datasets and architectures demonstrate our method's effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers b/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers new file mode 100644 index 0000000000..4b6ea726a7 --- /dev/null +++ b/data/2024/aaai/Computing the Why-Provenance for Datalog Queries via SAT Solvers @@ -0,0 +1 @@ +Explaining an answer to a Datalog query is an essential task towards Explainable AI, especially nowadays where Datalog plays a critical role in the development of ontology-based applications. A well-established approach for explaining a query answer is the so-called why-provenance, which essentially collects all the subsets of the input database that can be used to obtain that answer via some derivation process, typically represented as a proof tree. It is well known, however, that computing the why-provenance for Datalog queries is computationally expensive, and thus, very few attempts can be found in the literature. The goal of this work is to demonstrate how off-the-shelf SAT solvers can be exploited towards an efficient computation of the why-provenance for Datalog queries. Interestingly, our SAT-based approach allows us to build the why-provenance in an incremental fashion, that is, one explanation at a time, which is much more useful in a practical context than the one-shot computation of the whole set of explanations as done by existing approaches. \ No newline at end of file diff --git a/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation b/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation new file mode 100644 index 0000000000..eadfa2723e --- /dev/null +++ b/data/2024/aaai/ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation @@ -0,0 +1 @@ +Generative models can produce synthetic patient records for analytical tasks when real data is unavailable or limited. However, current methods struggle with adhering to domain-specific knowledge and removing invalid data. We present ConSequence, an effective approach to integrating domain knowledge into sequential generative neural network outputs. Our rule-based formulation includes temporal aggregation and antecedent evaluation modules, ensured by an efficient matrix multiplication formulation, to satisfy hard and soft logical constraints across time steps. Existing constraint methods often fail to guarantee constraint satisfaction, lack the ability to handle temporal constraints, and hinder the learning and computational efficiency of the model. In contrast, our approach efficiently handles all types of constraints with guaranteed logical coherence. We demonstrate ConSequence's effectiveness in generating electronic health records, outperforming competitors in achieving complete temporal and spatial constraint satisfaction without compromising runtime performance or generative quality. Specifically, ConSequence successfully prevents all rule violations while improving the model quality in reducing its test perplexity by 5% and incurring less than a 13% slowdown in generation speed compared to an unconstrained model. \ No newline at end of file diff --git a/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance b/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance new file mode 100644 index 0000000000..53896d6443 --- /dev/null +++ b/data/2024/aaai/ConVQG: Contrastive Visual Question Generation with Multimodal Guidance @@ -0,0 +1 @@ +Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines. \ No newline at end of file diff --git a/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning b/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..30af9f30df --- /dev/null +++ b/data/2024/aaai/ConcaveQ: Non-monotonic Value Function Factorization via Concave Representations in Deep Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Value function factorization has achieved great success in multi-agent reinforcement learning by optimizing joint action-value functions through the maximization of factorized per-agent utilities. To ensure Individual-Global-Maximum property, existing works often focus on value factorization using monotonic functions, which are known to result in restricted representation expressiveness. In this paper, we analyze the limitations of monotonic factorization and present ConcaveQ, a novel non-monotonic value function factorization approach that goes beyond monotonic mixing functions and employs neural network representations of concave mixing functions. Leveraging the concave property in factorization, an iterative action selection scheme is developed to obtain optimal joint actions during training. It is used to update agents’ local policy networks, enabling fully decentralized execution. The effectiveness of the proposed ConcaveQ is validated across scenarios involving multi-agent predator-prey environment and StarCraft II micromanagement tasks. Empirical results exhibit significant improvement of ConcaveQ over state-of-the-art multi-agent reinforcement learning approaches. \ No newline at end of file diff --git a/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning b/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning new file mode 100644 index 0000000000..11e4c050b5 --- /dev/null +++ b/data/2024/aaai/Concealing Sensitive Samples against Gradient Leakage in Federated Learning @@ -0,0 +1,2 @@ +Federated Learning (FL) is a distributed learning paradigm that enhances users' privacy by eliminating the need for clients to share raw, private data with the server. +Despite the success, recent studies expose the vulnerability of FL to model inversion attacks, where adversaries reconstruct users’ private data via eavesdropping on the shared gradient information. We hypothesize that a key factor in the success of such attacks is the low entanglement among gradients per data within the batch during stochastic optimization. This creates a vulnerability that an adversary can exploit to reconstruct the sensitive data. Building upon this insight, we present a simple, yet effective defense strategy that obfuscates the gradients of the sensitive data with concealed samples. To achieve this, we propose synthesizing concealed samples to mimic the sensitive data at the gradient level while ensuring their visual dissimilarity from the actual sensitive data. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance. Code is located at https://github.com/JingWu321/DCS-2. \ No newline at end of file diff --git a/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models b/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models new file mode 100644 index 0000000000..6004374a16 --- /dev/null +++ b/data/2024/aaai/Concept-Guided Prompt Learning for Generalization in Vision-Language Models @@ -0,0 +1,11 @@ +Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive +performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and +CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features +into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable +across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose +Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge +of CLIP to create a visual concept cache to enable conceptguided prompting. In order to refine the text features, we further +develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided +prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive +experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to +the current state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models b/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models new file mode 100644 index 0000000000..6bcbe4d198 --- /dev/null +++ b/data/2024/aaai/ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models @@ -0,0 +1 @@ +The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/ \ No newline at end of file diff --git a/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation b/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation new file mode 100644 index 0000000000..13c5f4b766 --- /dev/null +++ b/data/2024/aaai/ConditionVideo: Training-Free Condition-Guided Video Generation @@ -0,0 +1 @@ +Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods. \ No newline at end of file diff --git a/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression b/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression new file mode 100644 index 0000000000..fd445dde13 --- /dev/null +++ b/data/2024/aaai/Conditional Backdoor Attack via JPEG Compression @@ -0,0 +1 @@ +Deep neural network (DNN) models have been proven vulnerable to backdoor attacks. One trend of backdoor attacks is developing more invisible and dynamic triggers to make attacks stealthier. However, these invisible and dynamic triggers can be inadvertently mitigated by some widely used passive denoising operations, such as image compression, making the efforts under this trend questionable. Another trend is to exploit the full potential of backdoor attacks by proposing new triggering paradigms, such as hibernated or opportunistic backdoors. In line with these trends, our work investigates the first conditional backdoor attack, where the backdoor is activated by a specific condition rather than pre-defined triggers. Specifically, we take the JPEG compression as our condition and jointly optimize the compression operator and the target model's loss function, which can force the target model to accurately learn the JPEG compression behavior as the triggering condition. In this case, besides the conditional triggering feature, our attack is also stealthy and robust to denoising operations. Extensive experiments on the MNIST, GTSRB and CelebA verify our attack's effectiveness, stealthiness and resistance to existing backdoor defenses and denoising operations. As a new triggering paradigm, the conditional backdoor attack brings a new angle for assessing the vulnerability of DNN models, and conditioned over JPEG compression magnifies its threat due to the universal usage of JPEG. \ No newline at end of file diff --git a/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment b/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment new file mode 100644 index 0000000000..ebbf000317 --- /dev/null +++ b/data/2024/aaai/Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment @@ -0,0 +1 @@ +Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT. \ No newline at end of file diff --git a/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract) b/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract) new file mode 100644 index 0000000000..177c63fcc7 --- /dev/null +++ b/data/2024/aaai/Confidence Is All You Need for MI Attacks (Student Abstract) @@ -0,0 +1 @@ +In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point’s membership in a model’s training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being ’fit’ to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine-learning model. These confidence values provide a probabilistic measure of the model’s certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods. \ No newline at end of file diff --git a/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees b/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees new file mode 100644 index 0000000000..5ff64f4d86 --- /dev/null +++ b/data/2024/aaai/Conformal Autoregressive Generation: Beam Search with Coverage Guarantees @@ -0,0 +1,4 @@ +We introduce two new extensions to the beam search algorithm based on conformal predictions (CP) to produce sets of sequences with theoretical coverage guarantees. +The first method is very simple and proposes dynamically-sized subsets of beam search results but, unlike typical CP proceedures, has an upper bound on the achievable guarantee depending on a post-hoc calibration measure. +Our second algorithm introduces the conformal set prediction procedure as part of the decoding process, producing a variable beam width which adapts to the current uncertainty. +While more complex, this procedure can achieve coverage guarantees selected a priori. We provide marginal coverage bounds as well as calibration-conditional guarantees for each method, and evaluate them empirically on a selection of tasks drawing from natural language processing and chemistry. \ No newline at end of file diff --git a/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance b/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance new file mode 100644 index 0000000000..bf21c36952 --- /dev/null +++ b/data/2024/aaai/Conformal Crystal Graph Transformer with Robust Encoding of Periodic Invariance @@ -0,0 +1 @@ +Machine learning techniques, especially in the realm of materials design, hold immense promise in predicting the properties of crystal materials and aiding in the discovery of novel crystals with desirable traits. However, crystals possess unique geometric constraints—namely, E(3) invariance for primitive cell and periodic invariance—which need to be accurately reflected in crystal representations. Though past research has explored various construction techniques to preserve periodic invariance in crystal representations, their robustness remains inadequate. Furthermore, effectively capturing angular information within 3D crystal structures continues to pose a significant challenge for graph-based approaches. This study introduces novel solutions to these challenges. We first present a graph construction method that robustly encodes periodic invariance and a strategy to capture angular information in neural networks without compromising efficiency. We further introduce CrystalFormer, a pioneering graph transformer architecture that emphasizes angle preservation and enhances long-range information. Through comprehensive evaluation, we verify our model's superior performance in 5 crystal prediction tasks, reaffirming the efficiency of our proposed methods. \ No newline at end of file diff --git a/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming b/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming new file mode 100644 index 0000000000..99840453ec --- /dev/null +++ b/data/2024/aaai/Conformal Prediction Regions for Time Series Using Linear Complementarity Programming @@ -0,0 +1 @@ +Conformal prediction is a statistical tool for producing prediction regions of machine learning models that are valid with high probability. However, applying conformal prediction to time series data leads to conservative prediction regions. In fact, to obtain prediction regions over T time steps with confidence 1--delta, previous works require that each individual prediction region is valid with confidence 1--delta/T. We propose an optimization-based method for reducing this conservatism to enable long horizon planning and verification when using learning-enabled time series predictors. Instead of considering prediction errors individually at each time step, we consider a parameterized prediction error over multiple time steps. By optimizing the parameters over an additional dataset, we find prediction regions that are not conservative. We show that this problem can be cast as a mixed integer linear complementarity program (MILCP), which we then relax into a linear complementarity program (LCP). Additionally, we prove that the relaxed LP has the same optimal cost as the original MILCP. Finally, we demonstrate the efficacy of our method on case studies using pedestrian trajectory predictors and F16 fighter jet altitude predictors. \ No newline at end of file diff --git a/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum b/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum new file mode 100644 index 0000000000..5ccc58f25f --- /dev/null +++ b/data/2024/aaai/Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum @@ -0,0 +1 @@ +Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extending the capability of LLMs. Although there are some works that employ open-source LLMs for the tool-learning task, most of them are trained in a controlled environment in which LLMs only learn to execute the human-provided tools. However, selecting proper tools from the large toolset is also a crucial ability for the tool-learning model to be applied in real-world applications. Existing methods usually directly employ self-instruction methods to train the model, which ignores differences in tool complexity. In this paper, we propose the Confucius a novel tool-learning framework to train LLM to use complicated tools in real-world scenarios, which contains two main phases: (1) We first propose a multi-stage learning method to teach the LLM to use various tools from an easy-to-difficult curriculum; (2) thenceforth, we propose the Iterative Self-instruct from Introspective Feedback (ISIF) to dynamically construct the dataset to improve the ability to use the complicated tool. Extensive experiments conducted on both controlled and real-world settings demonstrate the superiority of our tool-learning framework in the real-world application scenario compared to both tuning-free (e.g., ChatGPT, Claude) and tuning-based baselines (e.g., GPT4Tools). \ No newline at end of file diff --git a/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments b/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments new file mode 100644 index 0000000000..de365913a9 --- /dev/null +++ b/data/2024/aaai/Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments @@ -0,0 +1 @@ +In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more pronounced domain shift and a notable decline in the overall model performance. Previous methods employed prototype methods for domain adaptation on robust feature spaces. However, these approaches struggle to effectively classify classes with similar features under noisy environments. To address this issue, we propose a new method to detect and correct confusing class pair. We first divide classes into easy and hard classes based on the small loss criterion. We then leverage the top-2 predictions for each sample after aligning the source and target domain to find the confusing pair in the hard classes. We apply label correction to the noisy samples within the confusing pair. With the proposed label correction method, we can train our model with more accurate labels. Extensive experiments confirm the effectiveness of our method and demonstrate its favorable performance compared with existing state-of-the-art methods. Our codes are publicly available at https://github.com/Hehxcf/CPC/. \ No newline at end of file diff --git a/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting b/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting new file mode 100644 index 0000000000..144049a23e --- /dev/null +++ b/data/2024/aaai/Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting @@ -0,0 +1 @@ +The forecasting of Multivariate Time Series (MTS) has long been an important but challenging task. Due to the non-stationary problem across long-distance time steps, previous studies primarily adopt stationarization method to attenuate the non-stationary problem of original series for better predictability. However, existed methods always adopt the stationarized series, which ignore the inherent non-stationarity, and have difficulty in modeling MTS with complex distributions due to the lack of stochasticity. To tackle these problems, we first develop a powerful hierarchical probabilistic generative module to consider the non-stationarity and stochastity characteristics within MTS, and then combine it with transformer for a well-defined variational generative dynamic model named Hierarchical Time series Variational Transformer (HTV-Trans), which recovers the intrinsic non-stationary information into temporal dependencies. Being an powerful probabilistic model, HTV-Trans is utilized to learn expressive representations of MTS and applied to the forecasting tasks. Extensive experiments on diverse datasets show the efficiency of HTV-Trans on MTS forecasting tasks. \ No newline at end of file diff --git a/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context b/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context new file mode 100644 index 0000000000..0e17826f13 --- /dev/null +++ b/data/2024/aaai/ConsistNER: Towards Instructive NER Demonstrations for LLMs with the Consistency of Ontology and Context @@ -0,0 +1 @@ +Named entity recognition (NER) aims to identify and classify specific entities mentioned in textual sentences. Most existing superior NER models employ the standard fully supervised paradigm, which requires a large amount of annotated data during training. In order to maintain performance with insufficient annotation resources (i.e., low resources), in-context learning (ICL) has drawn a lot of attention, due to its plug-and-play nature compared to other methods (e.g., meta-learning and prompt learning). In this manner, how to retrieve high-correlated demonstrations for target sentences serves as the key to emerging ICL ability. For the NER task, the correlation implies the consistency of both ontology (i.e., generalized entity type) and context (i.e., sentence semantic), which is ignored by previous NER demonstration retrieval techniques. To address this issue, we propose ConsistNER, a novel three-stage framework that incorporates ontological and contextual information for low-resource NER. Firstly, ConsistNER employs large language models (LLMs) to pre-recognize potential entities in a zero-shot manner. Secondly, ConsistNER retrieves the sentence-specific demonstrations for each target sentence based on the two following considerations: (1) Regarding ontological consistency, demonstrations are filtered into a candidate set based on ontology distribution. (2) Regarding contextual consistency, an entity-aware self-attention mechanism is introduced to focus more on the potential entities and semantic-correlated tokens. Finally, ConsistNER feeds the retrieved demonstrations for all target sentences into LLMs for prediction. We conduct experiments on four widely-adopted NER datasets, including both general and specific domains. Experimental results show that ConsistNER achieves a 6.01%-26.37% and 3.07%-21.18% improvement over the state-of-the-art baselines on Micro-F1 scores under 1- and 5-shot settings, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model b/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model new file mode 100644 index 0000000000..82478150ab --- /dev/null +++ b/data/2024/aaai/Consistency-GAN: Training GANs with Consistency Model @@ -0,0 +1 @@ +For generative learning tasks, there are three crucial criteria for generating samples from the models: quality, coverage/diversity, and sampling speed. Among the existing generative models, Generative adversarial networks (GANs) and diffusion models demonstrate outstanding quality performance while suffering from notable limitations. GANs can generate high-quality results and enable fast sampling, their drawbacks, however, lie in the limited diversity of the generated samples. On the other hand, diffusion models excel at generating high-quality results with a commendable diversity. Yet, its iterative generation process necessitates hundreds to thousands of sampling steps, leading to slow speeds that are impractical for real-time scenarios. To address the aforementioned problem, this paper proposes a novel Consistency-GAN model. In particular, to aid in the training of the GAN, we introduce instance noise, which employs consistency models using only a few steps compared to the conventional diffusion process. Our evaluations on various datasets indicate that our approach significantly accelerates sampling speeds compared to traditional diffusion models, while preserving sample quality and diversity. Furthermore, our approach also has better model coverage than traditional adversarial training methods. \ No newline at end of file diff --git a/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration b/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration new file mode 100644 index 0000000000..0f055c412e --- /dev/null +++ b/data/2024/aaai/Consistency-Guided Temperature Scaling Using Style and Content Information for Out-of-Domain Calibration @@ -0,0 +1 @@ +Research interests in the robustness of deep neural networks against domain shifts have been rapidly increasing in recent years. Most existing works, however, focus on improving the accuracy of the model, not the calibration performance which is another important requirement for trustworthy AI systems. Temperature scaling (TS), an accuracy-preserving post-hoc calibration method, has been proven to be effective in in-domain settings, but not in out-of-domain (OOD) due to the difficulty in obtaining a validation set for the unseen domain beforehand. In this paper, we propose consistency-guided temperature scaling (CTS), a new temperature scaling strategy that can significantly enhance the OOD calibration performance by providing mutual supervision among data samples in the source domains. Motivated by our observation that over-confidence stemming from inconsistent sample predictions is the main obstacle to OOD calibration, we propose to guide the scaling process by taking consistencies into account in terms of two different aspects - style and content - which are the key components that can well-represent data samples in multi-domain settings. Experimental results demonstrate that our proposed strategy outperforms existing works, achieving superior OOD calibration performance on various datasets. This can be accomplished by employing only the source domains without compromising accuracy, making our scheme directly applicable to various trustworthy AI systems. \ No newline at end of file diff --git a/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference b/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference new file mode 100644 index 0000000000..8f66930112 --- /dev/null +++ b/data/2024/aaai/ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference @@ -0,0 +1 @@ +Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers as the objective function during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only requires each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept "Memorized Layer" to measure the hardness of an instance. We incorporate the memorized layer into reward function design, which allows "easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks using PLMs and LLMs as backbones respectively. \ No newline at end of file diff --git a/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence b/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence new file mode 100644 index 0000000000..723b75fbd1 --- /dev/null +++ b/data/2024/aaai/Constrained Bayesian Optimization under Partial Observations: Balanced Improvements and Provable Convergence @@ -0,0 +1 @@ +The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduce balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose Gaussian processes embedding different likelihoods as the surrogate model for partially observable constraints. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs. \ No newline at end of file diff --git a/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming b/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming new file mode 100644 index 0000000000..61c2a5263d --- /dev/null +++ b/data/2024/aaai/Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming @@ -0,0 +1 @@ +Despite remarkable achievements in artificial intelligence, the deployability of learning-enabled systems in high-stakes real-world environments still faces persistent challenges. For example, in safety-critical domains like autonomous driving, robotic manipulation, and healthcare, it is crucial not only to achieve high performance but also to comply with given constraints. Furthermore, adaptability becomes paramount in non-stationary domains, where environmental parameters are subject to change. While safety and adaptability are recognized as key qualities for the new generation of AI, current approaches have not demonstrated effective adaptable performance in constrained settings. Hence, this paper breaks new ground by studying the unique challenges of ensuring safety in nonstationary environments by solving constrained problems through the lens of the meta-learning approach (learning to learn). While unconstrained meta-learning already encounters complexities in end to end differentiation of the loss due to the bi-level nature, its constrained counterpart introduces an additional layer of difficulty, since the constraints imposed on task-level updates complicate the differentiation process. To address the issue, we first employ successive convex-constrained policy updates across multiple tasks with differentiable convex programming, which allows meta-learning in constrained scenarios by enabling end-to-end differentiation. This approach empowers the agent to rapidly adapt to new tasks under nonstationarity while ensuring compliance with safety constraints. We also provide a theoretical analysis demonstrating guaranteed monotonic improvement of our approach, justifying our algorithmic designs. Extensive simulations across diverse environments provide empirical validation with significant improvement over established benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure b/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure new file mode 100644 index 0000000000..ca05a42478 --- /dev/null +++ b/data/2024/aaai/Constraint Latent Space Matters: An Anti-anomalous Waveform Transformation Solution from Photoplethysmography to Arterial Blood Pressure @@ -0,0 +1 @@ +Arterial blood pressure (ABP) holds substantial promise for proactive cardiovascular health management. Notwithstanding its potential, the invasive nature of ABP measurements confines their utility primarily to clinical environments, limiting their applicability for continuous monitoring beyond medical facilities. The conversion of photoplethysmography (PPG) signals into ABP equivalents has garnered significant attention due to its potential in revolutionizing cardiovascular disease management. Recent strides in PPG-to-ABP prediction encompass the integration of generative and discriminative models. Despite these advances, the efficacy of these models is curtailed by the latent space shift predicament, stemming from alterations in PPG data distribution across disparate hardware and individuals, potentially leading to distorted ABP waveforms. To tackle this problem, we present an innovative solution named the Latent Space Constraint Transformer (LSCT), leveraging a quantized codebook to yield robust latent spaces by employing multiple discretizing bases. To facilitate improved reconstruction, the Correlation-boosted Attention Module (CAM) is introduced to systematically query pertinent bases on a global scale. Furthermore, to enhance expressive capacity, we propose the Multi-Spectrum Enhancement Knowledge (MSEK), which fosters local information flow within the channels of latent code and provides additional embedding for reconstruction. Through comprehensive experimentation on both publicly available datasets and a private downstream task dataset, the proposed approach demonstrates noteworthy performance enhancements compared to existing methods. Extensive ablation studies further substantiate the effectiveness of each introduced module. \ No newline at end of file diff --git a/data/2024/aaai/Constructing Dreams Using Generative AI b/data/2024/aaai/Constructing Dreams Using Generative AI new file mode 100644 index 0000000000..958f189569 --- /dev/null +++ b/data/2024/aaai/Constructing Dreams Using Generative AI @@ -0,0 +1 @@ +Generative AI tools introduce new and accessible forms of media creation for youth. They also raise ethical concerns about the generation of fake media, data protection, privacy and ownership of AI-generated art. Since generative AI is already being used in products used by youth, it is critical that they understand how these tools work and how they can be used or misused. In this work, we facilitated students’ generative AI learning through expression of their imagined future identities. We designed a learning workshop - Dreaming with AI - where students learned about the inner workings of generative AI tools, used text-to-image generation algorithms to create their imaged future dreams, reflected on the potential benefits and harms of generative AI tools and voiced their opinions about policies for the use of these tools in classrooms. In this paper, we present the learning activities and experiences of 34 high school students who engaged in our workshops. Students reached creative learning objectives by using prompt engineering to create their future dreams, gained technical knowledge by learning the abilities, limitations, text-visual mappings and applications of generative AI, and identified most potential societal benefits and harms of generative AI. \ No newline at end of file diff --git a/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners b/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners new file mode 100644 index 0000000000..b76a8c29e7 --- /dev/null +++ b/data/2024/aaai/ContactGen: Contact-Guided Interactive 3D Human Generation for Partners @@ -0,0 +1 @@ +Among various interactions between humans, such as eye contact and gestures, physical interactions by contact can act as an essential moment in understanding human behaviors. Inspired by this fact, given a 3D partner human with the desired interaction label, we introduce a new task of 3D human generation in terms of physical contact. Unlike previous works of interacting with static objects or scenes, a given partner human can have diverse poses and different contact regions according to the type of interaction. To handle this challenge, we propose a novel method of generating interactive 3D humans for a given partner human based on a guided diffusion framework (ContactGen in short). Specifically, we newly present a contact prediction module that adaptively estimates potential contact regions between two input humans according to the interaction label. Using the estimated potential contact regions as complementary guidances, we dynamically enforce ContactGen to generate interactive 3D humans for a given partner human within a guided diffusion model. We demonstrate ContactGen on the CHI3D dataset, where our method generates physically plausible and diverse poses compared to comparison methods. \ No newline at end of file diff --git a/data/2024/aaai/Content Filtering with Inattentive Information Consumers b/data/2024/aaai/Content Filtering with Inattentive Information Consumers new file mode 100644 index 0000000000..eb471d8a30 --- /dev/null +++ b/data/2024/aaai/Content Filtering with Inattentive Information Consumers @@ -0,0 +1 @@ +We develop a model of content filtering as a game between the filter and the content consumer, where the latter incurs information costs for examining the content. Motivating examples include censoring misinformation, spam/phish filtering, and recommender systems acting on a stream of content. When the attacker is exogenous, we show that improving the filter’s quality is weakly Pareto improving, but has no impact on equilibrium payoffs until the filter becomes sufficiently accurate. Further, if the filter does not internalize the consumer’s information costs, its lack of commitment power may render it useless and lead to inefficient outcomes. When the attacker is also strategic, improvements in filter quality may decrease equilibrium payoffs. \ No newline at end of file diff --git a/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data b/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data new file mode 100644 index 0000000000..2ed336e645 --- /dev/null +++ b/data/2024/aaai/Context Enhanced Transformer for Single Image Object Detection in Video Data @@ -0,0 +1 @@ +With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR. \ No newline at end of file diff --git a/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation b/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation new file mode 100644 index 0000000000..04a5d4b91f --- /dev/null +++ b/data/2024/aaai/Context-Aware Iteration Policy Network for Efficient Optical Flow Estimation @@ -0,0 +1 @@ +Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a Context-Aware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of iterations per sample. The policy network achieves this by learning contextual information to realize whether flow improvement is bottlenecked or minimal. On the one hand, we use iteration embedding and historical hidden cell, which include previous iterations information, to convey how flow has changed from previous iterations. On the other hand, we use the incremental loss to make the policy network implicitly perceive the magnitude of optical flow improvement in the subsequent iteration. Furthermore, the computational complexity in our dynamic network is controllable, allowing us to satisfy various resource preferences with a single trained model. Our policy network can be easily integrated into state-of-the-art optical flow networks. Extensive experiments show that our method maintains performance while reducing FLOPs by about 40%/20% for the Sintel/KITTI datasets. \ No newline at end of file diff --git a/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval b/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval new file mode 100644 index 0000000000..211db82bbd --- /dev/null +++ b/data/2024/aaai/Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval @@ -0,0 +1 @@ +Different from the Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://anonymous.4open.science/r/Context-I2W-4224/. \ No newline at end of file diff --git a/data/2024/aaai/Contextual Pandora's Box b/data/2024/aaai/Contextual Pandora's Box new file mode 100644 index 0000000000..5058e5408d --- /dev/null +++ b/data/2024/aaai/Contextual Pandora's Box @@ -0,0 +1 @@ +Pandora’s Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative, while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora’s Box where the distributions are originally unknown. In this work, we study Pandora’s Box in the online setting, while incorporating context. At each round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well against the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative’s distribution (its reservation value) rather than its mean. \ No newline at end of file diff --git a/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning b/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning new file mode 100644 index 0000000000..b96b2438d6 --- /dev/null +++ b/data/2024/aaai/Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning @@ -0,0 +1 @@ +Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and few-shot transfer in a variety of domains. \ No newline at end of file diff --git a/data/2024/aaai/Continual Learning in an Open and Dynamic World b/data/2024/aaai/Continual Learning in an Open and Dynamic World new file mode 100644 index 0000000000..8ba76dc87d --- /dev/null +++ b/data/2024/aaai/Continual Learning in an Open and Dynamic World @@ -0,0 +1,2 @@ +Building autonomous agents that can process massive amounts of real-time sensor-captured data is essential for many real-world applications including autonomous vehicles, robotics and AI in medicine. As the agent often needs to explore in a dynamic environment, it is thus a desirable as well as challenging goal to enable the agent to learn over time without performance degradation. Continual learning aims to build a continual learner which can learn new concepts over the data stream while preserving previously learnt concepts. In the talk, I will survey three pieces of my recent research on continual learning (i) supervised continual learning, (ii) unsupervised continual learning, and (iii) multi-modal continual learning. In the first work, I will discuss a supervised +continual learning algorithm called MEGA which dynamically balances the old tasks and the new task. In the second work, I will discuss unsupervised continual learning algorithms which learn representation continually without access to the labels. In the third work, I will elaborate an efficient continual learning algorithm that can learn multiple modalities continually without forgetting. \ No newline at end of file diff --git a/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning b/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning new file mode 100644 index 0000000000..96271efcf4 --- /dev/null +++ b/data/2024/aaai/Continual Relation Extraction via Sequential Multi-Task Learning @@ -0,0 +1 @@ +To build continual relation extraction (CRE) models, those can adapt to an ever-growing ontology of relations, is a cornerstone information extraction task that serves in various dynamic real-world domains. To mitigate catastrophic forgetting in CRE, existing state-of-the-art approaches have effectively utilized rehearsal techniques from continual learning and achieved remarkable success. However, managing multiple objectives associated with memory-based rehearsal remains underexplored, often relying on simple summation and overlooking complex trade-offs. In this paper, we propose Continual Relation Extraction via Sequential Multi-task Learning (CREST), a novel CRE approach built upon a tailored Multi-task Learning framework for continual learning. CREST takes into consideration the disparity in the magnitudes of gradient signals of different objectives, thereby effectively handling the inherent difference between multi-task learning and continual learning. Through extensive experiments on multiple datasets, CREST demonstrates significant improvements in CRE performance as well as superiority over other state-of-the-art Multi-task Learning frameworks, offering a promising solution to the challenges of continual learning in this domain. \ No newline at end of file diff --git a/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification b/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification new file mode 100644 index 0000000000..a4ea736a25 --- /dev/null +++ b/data/2024/aaai/Continual Vision-Language Retrieval via Dynamic Knowledge Rectification @@ -0,0 +1 @@ +The recent large-scale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distill the affinity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect affinity, which prevents the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectification (DKR) that simultaneously achieves incorrect knowledge filtering and rectification. Specifically, we first filter the incorrect affinity knowledge calculated by the old model on the new data. Then, a knowledge rectification method is designed to rectify the incorrect affinities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new affinity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation b/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation new file mode 100644 index 0000000000..e1378fa83e --- /dev/null +++ b/data/2024/aaai/Continuous Piecewise-Affine Based Motion Model for Image Animation @@ -0,0 +1 @@ +Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM. \ No newline at end of file diff --git a/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding b/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding new file mode 100644 index 0000000000..75c02c30df --- /dev/null +++ b/data/2024/aaai/Continuous Rotation Group Equivariant Network Inspired by Neural Population Coding @@ -0,0 +1 @@ +Neural population coding can represent continuous information by neurons with a series of discrete preferred stimuli, and we find that the bell-shaped tuning curve plays an important role in this mechanism. Inspired by this, we incorporate a bell-shaped tuning curve into the discrete group convolution to achieve continuous group equivariance. Simply, we modulate group convolution kernels by Gauss functions to obtain bell-shaped tuning curves. Benefiting from the modulation, kernels also gain smooth gradients on geometric dimensions (e.g., location dimension and orientation dimension). It allows us to generate group convolution kernels from sparse weights with learnable geometric parameters, which can achieve both competitive performances and parameter efficiencies. Furthermore, we quantitatively prove that discrete group convolutions with proper tuning curves (bigger than 1x sampling step) can achieve continuous equivariance. Experimental results show that 1) our approach achieves very competitive performances on MNIST-rot with at least 75% fewer parameters compared with previous SOTA methods, which is efficient in parameter; 2) Especially with small sample sizes, our approach exhibits more pronounced performance improvements (up to 24%); 3) It also has excellent rotation generalization ability on various datasets such as MNIST, CIFAR, and ImageNet with both plain and ResNet architectures. \ No newline at end of file diff --git a/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing b/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing new file mode 100644 index 0000000000..839f539eae --- /dev/null +++ b/data/2024/aaai/Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing @@ -0,0 +1,25 @@ +We address the Individualized continuous treatment effect +(ICTE) estimation problem where we predict the effect of +any continuous valued treatment on an individual using ob- +servational data. The main challenge in this estimation task +is the potential confounding of treatment assignment with in- +dividual’s covariates in the training data, whereas during in- +ference ICTE requires prediction on independently sampled +treatments. In contrast to prior work that relied on regularizers +or unstable GAN training, we advocate the direct approach +of augmenting training individuals with independently sam- +pled treatments and inferred counterfactual outcomes. We in- +fer counterfactual outcomes using a two-pronged strategy: a +Gradient Interpolation for close-to-observed treatments, and +a Gaussian Process based Kernel Smoothing which allows +us to down weigh high variance inferences. We evaluate our +method on five benchmarks and show that our method out- +performs six state-of-the-art methods on the counterfactual +estimation error. We analyze the superior performance of our +method by showing that (1) our inferred counterfactual re- +sponses are more accurate, and (2) adding them to the train- +ing data reduces the distributional distance between the con- +founded training distribution and test distribution where treat- +ment is independent of covariates. Our proposed method is +model-agnostic and we show that it improves ICTE accuracy +of several existing models. \ No newline at end of file diff --git a/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process b/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process new file mode 100644 index 0000000000..c7cb80d42f --- /dev/null +++ b/data/2024/aaai/Continuous-Time Graph Representation with Sequential Survival Process @@ -0,0 +1 @@ +Over the past two decades, there has been a tremendous increase in the growth of representation learning methods for graphs, with numerous applications across various fields, including bioinformatics, chemistry, and the social sciences. However, current dynamic network approaches focus on discrete-time networks or treat links in continuous-time networks as instantaneous events. Therefore, these approaches have limitations in capturing the persistence or absence of links that continuously emerge and disappear over time for particular durations. To address this, we propose a novel stochastic process relying on survival functions to model the durations of links and their absences over time. This forms a generic new likelihood specification explicitly accounting for intermittent edge-persistent networks, namely GraSSP: Graph Representation with Sequential Survival Process. We apply the developed framework to a recent continuous time dynamic latent distance model characterizing network dynamics in terms of a sequence of piecewise linear movements of nodes in latent space. We quantitatively assess the developed framework in various downstream tasks, such as link prediction and network completion, demonstrating that the developed modeling framework accounting for link persistence and absence well tracks the intrinsic trajectories of nodes in a latent space and captures the underlying characteristics of evolving network structure. \ No newline at end of file diff --git a/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation b/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation new file mode 100644 index 0000000000..a35f007a1c --- /dev/null +++ b/data/2024/aaai/Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation @@ -0,0 +1 @@ +Estimating the individuals' potential response to varying treatment doses is crucial for decision-making in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods. \ No newline at end of file diff --git a/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation b/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation new file mode 100644 index 0000000000..55d6c96f8c --- /dev/null +++ b/data/2024/aaai/Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation @@ -0,0 +1 @@ +Recently, because of the high-quality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS. \ No newline at end of file diff --git a/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning b/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning new file mode 100644 index 0000000000..80f8e00dee --- /dev/null +++ b/data/2024/aaai/Contrastive Credibility Propagation for Reliable Semi-supervised Learning @@ -0,0 +1 @@ +Producing labels for unlabeled data is error-prone, making semi-supervised learning (SSL) troublesome. Often, little is known about when and why an algorithm fails to outperform a supervised baseline. Using benchmark datasets, we craft five common real-world SSL data scenarios: few-label, open-set, noisy-label, and class distribution imbalance/misalignment in the labeled and unlabeled sets. We propose a novel algorithm called Contrastive Credibility Propagation (CCP) for deep SSL via iterative transductive pseudo-label refinement. CCP unifies semi-supervised learning and noisy label learning for the goal of reliably outperforming a supervised baseline in any data scenario. Compared to prior methods which focus on a subset of scenarios, CCP uniquely outperforms the supervised baseline in all scenarios, supporting practitioners when the qualities of labeled or unlabeled data are unknown. \ No newline at end of file diff --git a/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract) b/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract) new file mode 100644 index 0000000000..c33ba3e5da --- /dev/null +++ b/data/2024/aaai/Contrastive Learning for Low-Light Raw Denoising (Student Abstract) @@ -0,0 +1 @@ +Image/video denoising in low-light scenes is an extremely challenging problem due to limited photon count and high noise. In this paper, we propose a novel approach with contrastive learning to address this issue. Inspired by the success of contrastive learning used in some high-level computer vision tasks, we bring in this idea to the low-level denoising task. In order to achieve this goal, we introduce a new denoising contrastive regularization (DCR) to exploit the information of noisy images and clean images. In the feature space, DCR makes the denoised image closer to the clean image and far away from the noisy image. In addition, we build a new feature embedding network called Wnet, which is more effective to extract high-frequency information. We conduct the experiments on a real low-light dataset that captures still images taken on a moonless clear night in 0.6 millilux and videos under starlight (no moon present). The results show that our method can achieve a higher PSNR and better visual quality compared with existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget b/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget new file mode 100644 index 0000000000..ac27bf2599 --- /dev/null +++ b/data/2024/aaai/Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget @@ -0,0 +1,2 @@ +Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features code not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that utilizes the implicit clustering of the Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Notably, MAE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip). Further, MAE-CT is compute efficient as it requires at most 10% overhead compared to MAE re-training. Applied to large and huge Vision Transformer (ViT) models, MAE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new state-of-the-art in linear probing of 82.2%. +Project page: github.com/ml-jku/MAE-CT. \ No newline at end of file diff --git a/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion b/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion new file mode 100644 index 0000000000..0376c0eedf --- /dev/null +++ b/data/2024/aaai/Controllable 3D Face Generation with Conditional Style Code Diffusion @@ -0,0 +1,2 @@ +Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. +Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods, which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be publicly available. \ No newline at end of file diff --git a/data/2024/aaai/Controllable Mind Visual Diffusion Model b/data/2024/aaai/Controllable Mind Visual Diffusion Model new file mode 100644 index 0000000000..05af62c4a5 --- /dev/null +++ b/data/2024/aaai/Controllable Mind Visual Diffusion Model @@ -0,0 +1 @@ +Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models. Diffusion-based methods have recently shown promise in analyzing functional magnetic resonance imaging (fMRI) data, including the reconstruction of high-quality images consistent with original visual stimuli. Nonetheless, it remains a critical challenge to effectively harness the semantic and silhouette information extracted from brain signals. In this paper, we propose a novel approach, termed as Controllable Mind Visual Diffusion Model (CMVDM). Specifically, CMVDM first extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks. Then, a control model is introduced in conjunction with a residual block to fully exploit the extracted information for image synthesis, generating high-quality images that closely resemble the original visual stimuli in both semantic content and silhouette characteristics. Through extensive experimentation, we demonstrate that CMVDM outperforms existing state-of-the-art methods both qualitatively and quantitatively. Our code is available at https://github.com/zengbohan0217/CMVDM. \ No newline at end of file diff --git a/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data b/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data new file mode 100644 index 0000000000..e2901b6061 --- /dev/null +++ b/data/2024/aaai/Controller-Guided Partial Label Consistency Regularization with Unlabeled Data @@ -0,0 +1 @@ +Partial label learning (PLL) learns from training examples each associated with multiple candidate labels, among which only one is valid. In recent years, benefiting from the strong capability of dealing with ambiguous supervision and the impetus of modern data augmentation methods, consistency regularization-based PLL methods have achieved a series of successes and become mainstream. However, as the partial annotation becomes insufficient, their performances drop significantly. In this paper, we leverage easily accessible unlabeled examples to facilitate the partial label consistency regularization. In addition to a partial supervised loss, our method performs a controller-guided consistency regularization at both the label-level and representation-level with the help of unlabeled data. To minimize the disadvantages of insufficient capabilities of the initial supervised model, we use the controller to estimate the confidence of each current prediction to guide the subsequent consistency regularization. Furthermore, we dynamically adjust the confidence thresholds so that the number of samples of each class participating in consistency regularization remains roughly equal to alleviate the problem of class-imbalance. Experiments show that our method achieves satisfactory performances in more practical situations, and its modules can be applied to existing PLL methods to enhance their capabilities. \ No newline at end of file diff --git a/data/2024/aaai/Conversational Modeling for Constraint Satisfaction b/data/2024/aaai/Conversational Modeling for Constraint Satisfaction new file mode 100644 index 0000000000..90e370c66a --- /dev/null +++ b/data/2024/aaai/Conversational Modeling for Constraint Satisfaction @@ -0,0 +1 @@ +Many problems, from Sudoku to factory scheduling, can be regarded as constraint satisfaction problems. A key component of real world problem solving is a conversation between a constraint programming expert and a problem domain expert to specify the problem to be solved. This presentation argues that the time is ripe for progress in automating the constraint programmer side of this conversation and suggests promising avenues for this pursuit. \ No newline at end of file diff --git a/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm b/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm new file mode 100644 index 0000000000..2d5381f6dd --- /dev/null +++ b/data/2024/aaai/Convolutional Channel-Wise Competitive Learning for the Forward-Forward Algorithm @@ -0,0 +1 @@ +The Forward-Forward (FF) Algorithm has been recently proposed to alleviate the issues of backpropagation (BP) commonly used to train deep neural networks. However, its current formulation exhibits limitations such as the generation of negative data, slower convergence, and inadequate performance on complex tasks. In this paper we take the main ideas of FF and improve them by leveraging channel-wise competitive learning in the context of convolutional neural networks for image classification tasks. A layer-wise loss function is introduced that promotes competitive learning and eliminates the need for negative data construction. To enhance both the learning of compositional features and feature space partitioning, a channel-wise feature separator and extractor block is proposed that complements the competitive learning process. Our method outperforms recent FF-based models on image classification tasks, achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the performance gap between FF learning and BP methods, indicating the potential of our proposed approach to learn useful representations in a layer-wise modular fashion, enabling more efficient and flexible learning. Our source code and supplementary material are available at https://github.com/andreaspapac/CwComp. \ No newline at end of file diff --git a/data/2024/aaai/Convolutional Spectral Kernel Learning with Generalization Guarantees (Abstract Reprint) b/data/2024/aaai/Convolutional Spectral Kernel Learning with Generalization Guarantees (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal b/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal new file mode 100644 index 0000000000..a3e3f1340b --- /dev/null +++ b/data/2024/aaai/Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal @@ -0,0 +1 @@ +In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional service-focused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measuring the achievement of their goals in a quantifiable way, making it difficult for existing research to directly optimize the dialogue procedure towards them. In our work, we emphasize the multifaceted nature of complex dialogue goals and argue that it is more feasible to accomplish them by comprehensively considering and jointly promoting their different aspects. To this end, we propose a novel dialogue framework, Cooper, which coordinates multiple specialized agents, each dedicated to a specific dialogue goal aspect separately, to approach the complex objective. Through this divide-and-conquer manner, we make complex dialogue goals more approachable and elicit greater intelligence via the collaboration of individual agents. Experiments on persuasion and emotional support dialogues demonstrate the superiority of our method over a set of competitive baselines. Our codes are available at https://github.com/YiCheng98/Cooper. \ No newline at end of file diff --git a/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach b/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach new file mode 100644 index 0000000000..e7c076a11e --- /dev/null +++ b/data/2024/aaai/Cooperative Knowledge Distillation: A Learner Agnostic Approach @@ -0,0 +1 @@ +Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate our approach not only outperforms baselines such as transfer learning, self-supervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot. \ No newline at end of file diff --git a/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract) b/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract) new file mode 100644 index 0000000000..e72c1ef12d --- /dev/null +++ b/data/2024/aaai/Coordination of Emergent Demand Changes via Value-Based Negotiation for Supply Chain Management (Student Abstract) @@ -0,0 +1,4 @@ +We propose an automated negotiation for a reinforcement learning agent to adapt the agent to unexpected situations such as demand changes in supply chain management (SCM). +Existing studies that consider reinforcement learning and SCM assume a centralized environment where the coordination of chain components is hierarchical rather than through negotiations between agents. +This study focused on a negotiation agent that considered the value function of reinforcement learning for SCM as its utility function in automated negotiation. +We demonstrated that the proposed approach could avoid inventory shortages under increased demand requests from the terminal customer. \ No newline at end of file diff --git a/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation b/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation new file mode 100644 index 0000000000..710bc6777d --- /dev/null +++ b/data/2024/aaai/CoreRec: A Counterfactual Correlation Inference for Next Set Recommendation @@ -0,0 +1 @@ +Next set recommendation aims to predict the items that are likely to be bought in the next purchase. Central to this endeavor is the task of capturing intra-set and cross-set correlations among items. However, the modeling of cross-set correlations poses challenges due to specific issues. Primarily, these correlations are often implicit, and the prevailing approach of establishing an indiscriminate link across the entire set of objects neglects factors like purchase frequency and correlations between purchased items. Such hastily formed connections across sets introduce substantial noise. Additionally, the preeminence of high-frequency items in numerous sets could potentially overshadow and distort correlation modeling with respect to low-frequency items. Thus, we devoted to mitigating misleading inter-set correlations. With a fresh perspective rooted in causality, we delve into the question of whether correlations between a particular item and items from other sets should be relied upon for item representation learning and set prediction. Technically, we introduce the Counterfactual Correlation Inference framework for next set recommendation, denoted as CoreRec. This framework establishes a counterfactual scenario in which the recommendation model impedes cross-set correlations to generate intervened predictions. By contrasting these intervened predictions with the original ones, we gauge the causal impact of inter-set neighbors on set prediction—essentially assessing whether they contribute to spurious correlations. During testing, we introduce a post-trained switch module that selects between set-aware item representations derived from either the original or the counterfactual scenarios. To validate our approach, we extensively experiment using three real-world datasets, affirming both the effectiveness of CoreRec and the cogency of our analytical approach. \ No newline at end of file diff --git a/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation b/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation new file mode 100644 index 0000000000..2f50016f74 --- /dev/null +++ b/data/2024/aaai/Coreference Graph Guidance for Mind-Map Generation @@ -0,0 +1 @@ +Mind-map generation aims to process a document into a hierarchical structure to show its central idea and branches. Such a manner is more conducive to understanding the logic and semantics of the document than plain text. Recently, a state-of-the-art method encodes the sentences of a document sequentially and converts them to a relation graph via sequence-to-graph. Though this method is efficient to generate mind-maps in parallel, its mechanism focuses more on sequential features while hardly capturing structural information. Moreover, it's difficult to model long-range semantic relations. In this work, we propose a coreference-guided mind-map generation network (CMGN) to incorporate external structure knowledge. Specifically, we construct a coreference graph based on the coreference semantic relationship to introduce the graph structure information. Then we employ a coreference graph encoder to mine the potential governing relations between sentences. In order to exclude noise and better utilize the information of the coreference graph, we adopt a graph enhancement module in a contrastive learning manner. Experimental results demonstrate that our model outperforms all the existing methods. The case study further proves that our model can more accurately and concisely reveal the structure and semantics of a document. Code and data are available at https://github.com/Cyno2232/CMGN. \ No newline at end of file diff --git a/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration b/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration new file mode 100644 index 0000000000..883b594b95 --- /dev/null +++ b/data/2024/aaai/Correlation Matching Transformation Transformers for UHD Image Restoration @@ -0,0 +1 @@ +This paper proposes UHDformer, a general Transformer for Ultra-High-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features learning from the high-resolution ones to facilitate better restoration. To better improve feature representation in low-resolution space, we propose to build feature transformation from the high-resolution space to the low-resolution one. To that end, we propose two new modules: Dual-path Correlation Matching Transformation module (DualCMT) and Adaptive Channel Modulator (ACM). The DualCMT selects top C/r (r is greater or equal to 1 which controls the squeezing level) correlation channels from the max-pooling/mean-pooling high-resolution features to replace low-resolution ones in Transformers, which can effectively squeeze useless content to improve the feature representation in low-resolution space to facilitate better recovery. The ACM is exploited to adaptively modulate multi-level high-resolution features, enabling to provide more useful features to low-resolution space for better learning. Experimental results show that our UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes will be made available at https://github.com/supersupercong/UHDformer. \ No newline at end of file diff --git a/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild b/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild new file mode 100644 index 0000000000..60775ec71d --- /dev/null +++ b/data/2024/aaai/Count What You Want: Exemplar Identification and Few-Shot Counting of Human Actions in the Wild @@ -0,0 +1 @@ +This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplar-based framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ``one'', ``two'', and ``three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC. \ No newline at end of file diff --git a/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint) b/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint) new file mode 100644 index 0000000000..c0bfed3b6e --- /dev/null +++ b/data/2024/aaai/Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ (Abstract Reprint) @@ -0,0 +1 @@ +Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of black-box deep-learning systems because people easily understand them, they apply across different problem domains and seem to be legally compliant. Although over 100 counterfactual methods exist in the XAI literature, each claiming to generate plausible explanations akin to those preferred by people, few of these methods have actually been tested on users (∼7%). Even fewer studies adopt a user-centered perspective; for instance, asking people for their counterfactual explanations to determine their perspective on a “good explanation”. This gap in the literature is addressed here using a novel methodology that (i) gathers human-generated counterfactual explanations for misclassified images, in two user studies and, then, (ii) compares these human-generated explanations to computationally-generated explanations for the same misclassifications. Results indicate that humans do not “minimally edit” images when generating counterfactual explanations. Instead, they make larger, “meaningful” edits that better approximate prototypes in the counterfactual class. An analysis based on “explanation goals” is proposed to account for this divergence between human and machine explanations. The implications of these proposals for future work are discussed. \ No newline at end of file diff --git a/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract) b/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract) new file mode 100644 index 0000000000..d1d076cd3a --- /dev/null +++ b/data/2024/aaai/Counterfactual Graph Learning for Anomaly Detection with Feature Disentanglement and Generation (Student Abstract) @@ -0,0 +1 @@ +Graph anomaly detection has received remarkable research interests, and various techniques have been employed for enhancing detection performance. However, existing models tend to learn dataset-specific spurious correlations based on statistical associations. A well-trained model might suffer from performance degradation when applied to newly observed nodes with different environments. To handle this situation, we propose CounterFactual Graph Anomaly Detection model, CFGAD. In this model, we design a gradient-based separator to disentangle node features into class features and environment features. Then, we present a weight-varying diffusion model to combine class features and environment features from different nodes to generate counterfactual samples. These counterfactual samples will be adopted to enhance model robustness. Comprehensive experiments demonstrate the effectiveness of our CFGAD. \ No newline at end of file diff --git a/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis b/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis new file mode 100644 index 0000000000..6bfa3d034c --- /dev/null +++ b/data/2024/aaai/Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis @@ -0,0 +1 @@ +Despite having achieved notable success for aspect-based sentiment analysis (ABSA), deep neural networks are susceptible to spurious correlations between input features and output labels, leading to poor robustness. In this paper, we propose a novel Counterfactual-Enhanced Information Bottleneck framework (called CEIB) to reduce spurious correlations for ABSA. CEIB extends the information bottleneck (IB) principle to a factual-counterfactual balancing setting by integrating augmented counterfactual data, with the goal of learning a robust ABSA model. Concretely, we first devise a multi-pattern prompting method, which utilizes the large language model (LLM) to generate high-quality counterfactual samples from the original samples. Then, we employ the information bottleneck principle and separate the mutual information into factual and counterfactual parts. In this way, we can learn effective and robust representations for the ABSA task by balancing the predictive information of these two parts. Extensive experiments on five benchmark ABSA datasets show that our CEIB approach achieves superior prediction performance and robustness over the state-of-the-art baselines. Code and data to reproduce the results in this paper is available at: https://github.com/shesshan/CEIB. \ No newline at end of file diff --git a/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations b/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations new file mode 100644 index 0000000000..bb99688c39 --- /dev/null +++ b/data/2024/aaai/Coupled Confusion Correction: Learning from Crowds with Sparse Annotations @@ -0,0 +1 @@ +As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches. Source codes are available at: https://github.com/Hansong-Zhang/CCC. \ No newline at end of file diff --git a/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study b/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study new file mode 100644 index 0000000000..9563b06f36 --- /dev/null +++ b/data/2024/aaai/Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study @@ -0,0 +1 @@ +In this work, we rigorously investigate the robustness of graph neural fractional-order differential equation (FDE) models. This framework extends beyond traditional graph neural (integer-order) ordinary differential equation (ODE) models by implementing the time-fractional Caputo derivative. Utilizing fractional calculus allows our model to consider long-term memory during the feature updating process, diverging from the memoryless Markovian updates seen in traditional graph neural ODE models. The superiority of graph neural FDE models over graph neural ODE models has been established in environments free from attacks or perturbations. While traditional graph neural ODE models have been verified to possess a degree of stability and resilience in the presence of adversarial attacks in existing literature, the robustness of graph neural FDE models, especially under adversarial conditions, remains largely unexplored. This paper undertakes a detailed assessment of the robustness of graph neural FDE models. We establish a theoretical foundation outlining the robustness characteristics of graph neural FDE models, highlighting that they maintain more stringent output perturbation bounds in the face of input and graph topology disturbances, compared to their integer-order counterparts. Our empirical evaluations further confirm the enhanced robustness of graph neural FDE models, highlighting their potential in adversarially robust applications. \ No newline at end of file diff --git a/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data b/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data new file mode 100644 index 0000000000..4f02428d3c --- /dev/null +++ b/data/2024/aaai/Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data @@ -0,0 +1 @@ +Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper, we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning b/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning new file mode 100644 index 0000000000..399d81dd94 --- /dev/null +++ b/data/2024/aaai/Critic-Guided Decision Transformer for Offline Reinforcement Learning @@ -0,0 +1 @@ +Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning b/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning new file mode 100644 index 0000000000..7c11fc9271 --- /dev/null +++ b/data/2024/aaai/Cross-Class Feature Augmentation for Class Incremental Learning @@ -0,0 +1 @@ +We propose a novel class incremental learning approach, which incorporates a feature augmentation technique motivated by adversarial attacks. We employ a classifier learned in the past to complement training examples of previous tasks. The proposed approach has an unique perspective to utilize the previous knowledge in class incremental learning since it augments features of arbitrary target classes using examples in other classes via adversarial attacks on a previously learned classifier. By allowing the Cross-Class Feature Augmentations (CCFA), each class in the old tasks conveniently populates samples in the feature space, which alleviates the collapse of the decision boundaries caused by sample deficiency for the previous tasks, especially when the number of stored exemplars is small. This idea can be easily incorporated into existing class incremental learning algorithms without any architecture modification. Extensive experiments on the standard benchmarks show that our method consistently outperforms existing class incremental learning methods by significant margins in various scenarios, especially under an environment with an extremely limited memory budget. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks b/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks new file mode 100644 index 0000000000..9db376eca7 --- /dev/null +++ b/data/2024/aaai/Cross-Constrained Progressive Inference for 3D Hand Pose Estimation with Dynamic Observer-Decision-Adjuster Networks @@ -0,0 +1 @@ +Generalization is very important for pose estimation, especially for 3D pose estimation where small changes in the 2D images could trigger structural changes in the 3D space. To achieve generalization, the system needs to have the capability of detecting estimation errors by double-checking the projection coherence between the 3D and 2D spaces and adapting its network inference process based on this feedback. Current pose estimation is one-time feed-forward and lacks the capability to gather feedback and adapt the inference outcome. To address this problem, we propose to explore the concept of progressive inference where the network learns an observer to continuously detect the prediction error based on constraints matching, as well as an adjuster to refine its inference outcome based on these constraints errors. Within the context of 3D hand pose estimation, we find that this observer-adjuster design is relatively unstable since the observer is operating in the 2D image domain while the adjuster is operating in the 3D domain. To address this issue, we propose to construct two sets of observers-adjusters with complementary constraints from different perspectives. They operate in a dynamic sequential manner controlled by a decision network to progressively improve the 3D pose estimation. We refer to this method as Cross-Constrained Progressive Inference (CCPI). Our extensive experimental results on FreiHAND and HO-3D benchmark datasets demonstrate that the proposed CCPI method is able to significantly improve the generalization capability and performance of 3D hand pose estimation. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark b/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark new file mode 100644 index 0000000000..c3ef5cd726 --- /dev/null +++ b/data/2024/aaai/Cross-Covariate Gait Recognition: A Benchmark @@ -0,0 +1 @@ +Gait datasets are essential for gait research. However, this paper observes that present benchmarks, whether conventional constrained or emerging real-world datasets, fall short regarding covariate diversity. To bridge this gap, we undertake an arduous 20-month effort to collect a cross-covariate gait recognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6 million sequences; almost every subject has 33 views and 53 different covariates. Compared to existing datasets, CCGR has both population and individual-level diversity. In addition, the views and covariates are well labeled, enabling the analysis of the effects of different factors. CCGR provides multiple types of gait data, including RGB, parsing, silhouette, and pose, offering researchers a comprehensive resource for exploration. In order to delve deeper into addressing cross-covariate gait recognition, we propose parsing-based gait recognition (ParsingGait) by utilizing the newly proposed parsing data. We have conducted extensive experiments. Our main results show: 1) Cross-covariate emerges as a pivotal challenge for practical applications of gait recognition. 2) ParsingGait demonstrates remarkable potential for further advancement. 3) Alarmingly, existing SOTA methods achieve less than 43% accuracy on the CCGR, highlighting the urgency of exploring cross-covariate gait recognition. Link: https://github.com/ShinanZou/CCGR. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering b/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering new file mode 100644 index 0000000000..bcccf1ef05 --- /dev/null +++ b/data/2024/aaai/Cross-Domain Contrastive Learning for Time Series Clustering @@ -0,0 +1,3 @@ +Most deep learning-based time series clustering models concentrate on data representation in a separate process from clustering. This leads to that clustering loss cannot guide feature extraction. Moreover, most methods solely analyze data from the temporal domain, disregarding the potential within the frequency domain. + +To address these challenges, we introduce a novel end-to-end Cross-Domain Contrastive learning model for time series Clustering (CDCC). Firstly, it integrates the clustering process and feature extraction using contrastive constraints at both cluster-level and instance-level. Secondly, the data is encoded simultaneously in both temporal and frequency domains, leveraging contrastive learning to enhance within-domain representation. Thirdly, cross-domain constraints are proposed to align the latent representations and category distribution across domains. With the above strategies, CDCC not only achieves end-to-end output but also effectively integrates frequency domains. Extensive experiments and visualization analysis are conducted on 40 time series datasets from UCR, demonstrating the superior performance of the proposed model. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer b/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer new file mode 100644 index 0000000000..e7fb81f8a0 --- /dev/null +++ b/data/2024/aaai/Cross-Gate MLP with Protein Complex Invariant Embedding Is a One-Shot Antibody Designer @@ -0,0 +1 @@ +Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementarity-determining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadequate geometric modeling. Moreover, the common iterative refinement strategies lead to an inefficient inference. In this paper, we propose a simple yet effective model that can co-design 1D sequences and 3D structures of CDRs in a one-shot manner. To achieve this, we decouple the antibody CDR design problem into two stages: (i) geometric modeling of protein complex structures and (ii) sequence-structure co-learning. We develop a novel macromolecular structure invariant embedding, typically for protein complexes, that captures both intra- and inter-component interactions among the backbone atoms, including Calpha, N, C, and O atoms, to achieve comprehensive geometric modeling. Then, we introduce a simple cross-gate MLP for sequence-structure co-learning, allowing sequence and structure representations to implicitly refine each other. This enables our model to design desired sequences and structures in a one-shot manner. Extensive experiments are conducted to evaluate our results at both the sequence and structure level, which demonstrate that our model achieves superior performance compared to the state-of-the-art antibody CDR design methods. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification b/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification new file mode 100644 index 0000000000..2880f02950 --- /dev/null +++ b/data/2024/aaai/Cross-Layer and Cross-Sample Feature Optimization Network for Few-Shot Fine-Grained Image Classification @@ -0,0 +1 @@ +Recently, a number of Few-Shot Fine-Grained Image Classification (FS-FGIC) methods have been proposed, but they primarily focus on better fine-grained feature extraction while overlooking two important issues. The first one is how to extract discriminative features for Fine-Grained Image Classification tasks while reducing trivial and non-generalizable sample level noise introduced in this procedure, to overcome the over-fitting problem under the setting of Few-Shot Learning. The second one is how to achieve satisfying feature matching between limited support and query samples with variable spatial positions and angles. To address these issues, we propose a novel Cross-layer and Cross-sample feature optimization Network for FS-FGIC, C2-Net for short. The proposed method consists of two main modules: Cross-Layer Feature Refinement (CLFR) module and Cross-Sample Feature Adjustment (CSFA) module. The CLFR module further refines the extracted features while integrating outputs from multiple layers to suppress sample-level feature noise interference. Additionally, the CSFA module addresses the feature mismatch between query and support samples through both channel activation and position matching operations. Extensive experiments have been conducted on five fine-grained benchmark datasets, and the results show that the C2-Net outperforms other state-of-the-art methods by a significant margin in most cases. Our code is available at: https://github.com/zenith0923/C2-Net. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering b/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering new file mode 100644 index 0000000000..fff8fd4ab2 --- /dev/null +++ b/data/2024/aaai/Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering @@ -0,0 +1 @@ +Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding b/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding new file mode 100644 index 0000000000..d50748b6cc --- /dev/null +++ b/data/2024/aaai/Cross-Modal Match for Language Conditioned 3D Object Grounding @@ -0,0 +1 @@ +Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval b/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval new file mode 100644 index 0000000000..5cce15b247 --- /dev/null +++ b/data/2024/aaai/Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval @@ -0,0 +1 @@ +Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa. \ No newline at end of file diff --git a/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition b/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition new file mode 100644 index 0000000000..d131a45ea8 --- /dev/null +++ b/data/2024/aaai/Cross-Sentence Gloss Consistency for Continuous Sign Language Recognition @@ -0,0 +1 @@ +Continuous sign language recognition (CSLR) aims to recognize gloss sequences from continuous sign videos. Recent works enhance the gloss representation consistency by mining correlations between visual and contextual modules within individual sentences. However, there still remain much richer correlations among glosses across different sentences. In this paper, we present a simple yet effective Cross-Sentence Gloss Consistency (CSGC), which enforces glosses belonging to a same category to be more consistent in representation than those belonging to different categories, across all training sentences. Specifically, in CSGC, a prototype is maintained for each gloss category and benefits the gloss discrimination in a contrastive way. Thanks to the well-distinguished gloss prototype, an auxiliary similarity classifier is devised to enhance the recognition clues, thus yielding more accurate results. Extensive experiments conducted on three CSLR datasets show that our proposed CSGC significantly boosts the performance of CSLR, surpassing existing state-of-the-art works by large margins (i.e., 1.6% on PHOENIX14, 2.4% on PHOENIX14-T, and 5.7% on CSL-Daily). \ No newline at end of file diff --git a/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues b/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues new file mode 100644 index 0000000000..ee12acda6f --- /dev/null +++ b/data/2024/aaai/CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues @@ -0,0 +1 @@ +Accurate identification of protein nucleic acid binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large scale protein language model. Specifically, our multi modal approach leverages a contrastive learning technique and atom wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state of the art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1 Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind. \ No newline at end of file diff --git a/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems b/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems new file mode 100644 index 0000000000..a6461c85fd --- /dev/null +++ b/data/2024/aaai/CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems @@ -0,0 +1 @@ +We present CrystalBox, a novel, model-agnostic, posthoc explainability framework for Deep Reinforcement Learning (DRL) controllers in the large family of input-driven environments which includes computer systems. We combine the natural decomposability of reward functions in input-driven environments with the explanatory power of decomposed returns. We propose an efficient algorithm to generate future-based explanations across both discrete and continuous control environments. Using applications such as adaptive bitrate streaming and congestion control, we demonstrate CrystalBox's capability to generate high-fidelity explanations. We further illustrate its higher utility across three practical use cases: contrastive explanations, network observability, and guided reward design, as opposed to prior explainability techniques that identify salient features. \ No newline at end of file diff --git a/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow b/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow new file mode 100644 index 0000000000..3c732bc6ca --- /dev/null +++ b/data/2024/aaai/Cumulative Difference Learning VAE for Time-Series with Temporally Correlated Inflow-Outflow @@ -0,0 +1 @@ +Time-series generation has crucial practical significance for decision-making under uncertainty. Existing methods have various limitations like accumulating errors over time, significantly impacting downstream tasks. We develop a novel generation method, DT-VAE, that incorporates generalizable domain knowledge, is mathematically justified, and significantly outperforms existing methods by mitigating error accumulation through a cumulative difference learning mechanism. We evaluate the performance of DT-VAE on several downstream tasks using both semi-synthetic and real time-series datasets, including benchmark datasets and our newly curated COVID-19 hospitalization datasets. The COVID-19 datasets enrich existing resources for time-series analysis. Additionally, we introduce Diverse Trend Preserving (DTP), a time-series clustering-based evaluation for direct and interpretable assessments of generated samples, serving as a valuable tool for evaluating time-series generative models. \ No newline at end of file diff --git a/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization b/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization new file mode 100644 index 0000000000..0b6d3ccaca --- /dev/null +++ b/data/2024/aaai/Cumulative Regret Analysis of the Piyavskii-Shubert Algorithm and Its Variants for Global Optimization @@ -0,0 +1 @@ +We study the problem of global optimization, where we analyze the performance of the Piyavskii--Shubert algorithm and its variants. For any given time duration T, instead of the extensively studied simple regret (which is the difference of the losses between the best estimate up to T and the global minimum), we study the cumulative regret up to time T. For L-Lipschitz continuous functions, we show that the cumulative regret is O(L logT). For H-Lipschitz smooth functions, we show that the cumulative regret is O(H). We analytically extend our results for functions with Hölder continuous derivatives, which cover both the Lipschitz continuous and the Lipschitz smooth functions, individually. We further show that a simpler variant of the Piyavskii-Shubert algorithm performs just as well as the traditional variants for the Lipschitz continuous or the Lipschitz smooth functions. We further extend our results to broader classes of functions, and show that, our algorithm efficiently determines its queries; and achieves nearly minimax optimal (up to log factors) cumulative regret, for general convex or even concave regularity conditions on the extrema of the objective (which encompasses many preceding regularities). We consider further extensions by investigating the performance of the Piyavskii-Shubert variants in the scenarios with unknown regularity, noisy evaluation and multivariate domain. \ No newline at end of file diff --git a/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds b/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds new file mode 100644 index 0000000000..e6917d1182 --- /dev/null +++ b/data/2024/aaai/Curvature-Invariant Adversarial Attacks for 3D Point Clouds @@ -0,0 +1 @@ +Imperceptibility is one of the crucial requirements for adversarial examples. Previous adversarial attacks on 3D point cloud recognition suffer from noticeable outliers, resulting in low imperceptibility. We think that the drawbacks can be alleviated via taking the local curvature of the point cloud into consideration. Existing approaches introduce the local geometry distance into the attack objective function. However, their definition of the local geometry distance neglects different perceptibility of distortions along different directions. In this paper, we aim to enhance the imperceptibility of adversarial attacks on 3D point cloud recognition by better preserving the local curvature of the original 3D point clouds. To this end, we propose the Curvature-Invariant Method (CIM), which directly regularizes the back-propagated gradient during the generation of adversarial point clouds based on two assumptions. Specifically, we first decompose the back-propagated gradients into the tangent plane and the normal direction. Then we directly reduce the gradient along the large curvature direction on the tangent plane and only keep the gradient along the negative normal direction. Comprehensive experimental comparisons confirm the superiority of our approach. Notably, our strategy can achieve 7.2% and 14.5% improvements in Hausdorff distance and Gaussian curvature measurements of the imperceptibility. \ No newline at end of file diff --git a/data/2024/aaai/Curved Representation Space of Vision Transformers b/data/2024/aaai/Curved Representation Space of Vision Transformers new file mode 100644 index 0000000000..d1584bd345 --- /dev/null +++ b/data/2024/aaai/Curved Representation Space of Vision Transformers @@ -0,0 +1 @@ +Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training. \ No newline at end of file diff --git a/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning b/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning new file mode 100644 index 0000000000..ae0bbf628d --- /dev/null +++ b/data/2024/aaai/Customizing Language Model Responses with Contrastive In-Context Learning @@ -0,0 +1,2 @@ +Large language models (LLMs) are becoming increasingly important for machine learning applications. However, it can be challenging to align LLMs with our intent, particularly when we want to generate content that is preferable over others or when we want the LLM to respond in a certain style or tone that is hard to describe. To address this challenge, we propose an approach that uses contrastive examples to better describe our intent. This involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want LLMs to avoid. The negative examples can be retrieved from labeled data, written by a human, or generated by the LLM itself. +Before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. This reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer. We tested our approach on both synthesized and real-world datasets, including StackExchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting. \ No newline at end of file diff --git a/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation b/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation new file mode 100644 index 0000000000..c1b4b20005 --- /dev/null +++ b/data/2024/aaai/CutFreq: Cut-and-Swap Frequency Components for Low-Level Vision Augmentation @@ -0,0 +1 @@ +Low-level vision plays a crucial role in a wide range of imaging quality and image recognition applications. However, the limited size, quality, and diversity of datasets often pose significant challenges for low-level tasks. Data augmentation is the most effective and practical way of sample expansion, but the commonly used augmentation methods in high-level tasks have limited improvement in the low-level due to the boundary effects or the non-realistic context information. In this paper, we propose the Cut-and-Swap Frequency Components (CutFreq) method for low-level vision, which aims to preserve high-level representations with directionality and improve image synthesis quality. Observing the significant frequency domain differences between reconstructed images and real ones, in CutFreq, we propose to transform the input and real images separately in the frequency domain, then define two stages for the model training process, and finally swap the specified frequency bands respectively and inversely transform to generate augmented samples. The experimental results show the superior performance of CutFreq on five low-level vision tasks. Moreover, we demonstrate the effectiveness of CutFreq in the low-data regime. Code is available at https://github.com/DreamerCCC/CutFreq. \ No newline at end of file diff --git a/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs b/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs new file mode 100644 index 0000000000..ba3c1a1567 --- /dev/null +++ b/data/2024/aaai/CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs @@ -0,0 +1 @@ +Building a skilled cybersecurity workforce is paramount to building a safer digital world. However, the diverse skill set, constantly emerging vulnerabilities, and deployment of new cyber threats make learning cybersecurity challenging. Traditional education methods struggle to cope with cybersecurity's rapidly evolving landscape and keep students engaged and motivated. Different studies on students' behaviors show that an interactive mode of education by engaging through a question-answering system or dialoguing is one of the most effective learning methodologies. There is a strong need to create advanced AI-enabled education tools to promote interactive learning in cybersecurity. Unfortunately, there are no publicly available standard question-answer datasets to build such systems for students and novice learners to learn cybersecurity concepts, tools, and techniques. The education course material and online question banks are unstructured and need to be validated and updated by domain experts, which is tedious when done manually. In this paper, we propose CyberGen, a novel unification of large language models (LLMs) and knowledge graphs (KG) to generate the questions and answers for cybersecurity automatically. Augmenting the structured knowledge from knowledge graphs in prompts improves factual reasoning and reduces hallucinations in LLMs. We used the knowledge triples from cybersecurity knowledge graphs (AISecKG) to design prompts for ChatGPT and generate questions and answers using different prompting techniques. Our question-answer dataset, CyberQ, contains around 4k pairs of questions and answers. The domain expert manually evaluated the random samples for consistency and correctness. We train the generative model using the CyberQ dataset for question answering task. \ No newline at end of file diff --git a/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation b/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation new file mode 100644 index 0000000000..98fda9fad0 --- /dev/null +++ b/data/2024/aaai/Cycle Self-Refinement for Multi-Source Domain Adaptation @@ -0,0 +1 @@ +Multi-source domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to the unlabeled target domain. In this paper, we propose a cycle self-refinement domain adaptation method, which progressively attempts to learn the dominant transferable knowledge in each source domain in a cycle manner. Specifically, several source-specific networks and a domain-ensemble network are adopted in the proposed method. The source-specific networks are adopted to provide the dominant transferable knowledge in each source domain for instance-level ensemble on predictions of the samples in target domain. Then these samples with high-confidence ensemble predictions are adopted to refine the domain-ensemble network. Meanwhile, to guide each source-specific network to learn more dominant transferable knowledge, we force the features of the target domain from the domain-ensemble network and the features of each source domain from the corresponding source-specific network to be aligned with their predictions from the corresponding networks. Thus the adaptation ability of source-specific networks and the domain-ensemble network can be improved progressively. Extensive experiments on Office-31, Office-Home and DomainNet show that the proposed method outperforms the state-of-the-art methods for most tasks. \ No newline at end of file diff --git a/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding b/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding new file mode 100644 index 0000000000..c64d9595e5 --- /dev/null +++ b/data/2024/aaai/Cycle-Consistency Learning for Captioning and Grounding @@ -0,0 +1 @@ +We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On b/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On new file mode 100644 index 0000000000..37e493ac68 --- /dev/null +++ b/data/2024/aaai/CycleVTON: A Cycle Mapping Framework for Parser-Free Virtual Try-On @@ -0,0 +1 @@ +Image-based virtual try-on aims to transfer a target clothing onto a specific person. A significant challenge is arbitrarily matched clothing and person lack corresponding ground truth to supervised learning. A recent pioneering work leveraged an improved cycleGAN to enable one network to generate the desired image for another network during training. However, there is no difference in the result distribution before and after the clothing changes. Therefore, using two different networks is unnecessary and may even increase the difficulty of convergence. Furthermore, the introduced human parsing used to provide body structure information in the input also have a negative impact on the try-on result. How to employ a single network for supervised learning while eliminating human parsing? To tackle these issues, we present a Cycle mapping Virtual Try-On Network (CycleVTON), which can produce photo-realistic try-on results by using a cycle mapping framework without the parser. In particular, we introduce a flow constraint loss to achieve supervised learning of arbitrarily matched clothing and person as inputs to the deformer, thus naturally mimicking the interaction between clothing and the human body. Additionally, we design a skin generation strategy that can adapt to the shape of the target clothing by dynamically adjusting the skin region, i.e., by first removing and then filling skin areas. Extensive experiments conducted on challenging benchmarks demonstrate that our proposed method exhibits superior performance compared to state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations b/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations new file mode 100644 index 0000000000..2468811ecd --- /dev/null +++ b/data/2024/aaai/D3: A Methodological Exploration of Domain Division, Modeling, and Balance in Multi-Domain Recommendations @@ -0,0 +1 @@ +To enhance the efficacy of multi-scenario services in industrial recommendation systems, the emergence of multi-domain recommendation has become prominent, which entails simultaneous modeling of all domains through a unified model, effectively capturing commonalities and differences among them. However, current methods rely on manual domain partitioning, which overlook the intricate domain relationships and the heterogeneity of different domains during joint optimization, hindering the integration of domain commonalities and differences. To address these challenges, this paper proposes a universal and flexible framework D3 aimed at optimizing the multi-domain recommendation pipeline from three key aspects. Firstly, an attention-based domain adaptation module is introduced to automatically identify and incorporate domain-sensitive features during training. Secondly, we propose a fusion gate module that enables the seamless integration of commonalities and diversities among domains, allowing for implicit characterization of intricate domain relationships. Lastly, we tackle the issue of joint optimization by deriving loss weights from two complementary viewpoints: domain complexity and domain specificity, alleviating inconsistencies among different domains during the training phase. Experiments on three public datasets demonstrate the effectiveness and superiority of our proposed framework. In addition, D3 has been implemented on a real-life, high-traffic internet platform catering to millions of users daily. \ No newline at end of file diff --git a/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning b/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning new file mode 100644 index 0000000000..6261b139de --- /dev/null +++ b/data/2024/aaai/DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer Learning @@ -0,0 +1 @@ +Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which is updated by all these languages. The extracted representations inevitably contain different source languages' information, which may disturb the learning of the language-specific classifiers. Additionally, due to the language gap, language-specific classifiers trained with source labels are unable to make accurate predictions for the target language. Both facts impair the model's performance. To address these challenges, we propose a Disentangled and Adaptive Network ~(DA-Net). Firstly, we devise a feedback-guided collaborative disentanglement method that seeks to purify input representations of classifiers, thereby mitigating mutual interference from multiple sources. Secondly, we propose a class-aware parallel adaptation method that aligns class-level distributions for each source-target language pair, thereby alleviating the language pairs' language gap. Experimental results on three different tasks involving 38 languages validate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation b/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation new file mode 100644 index 0000000000..edb938005d --- /dev/null +++ b/data/2024/aaai/DAG-Aware Variational Autoencoder for Social Propagation Graph Generation @@ -0,0 +1 @@ +Propagation models in social networks are critical, with extensive applications across various fields and downstream tasks. However, existing propagation models are often oversimplified, scenario-specific, and lack real-world user social attributes. These limitations detaching from real-world analysis lead to inaccurate representations of the propagation process in social networks. To address these issues, we propose a User Features Attention-based DAG-Aware Variational Autoencoder (DAVA) for propagation graph generation. First, nearly 1 million pieces of user attributes data are collected. Then DAVA can integrate the analysis of propagation graph topology and corresponding user attributes as prior knowledge. By leveraging a lightweight attention-based framework and a sliding window mechanism based on BFS permutations weighted by user influence, DAVA significantly enhances the ability to generate realistic, large-scale propagation data, yielding graph scales ten times greater than those produced by existing SOTA methods. Every module of DAVA has flexibility and extension that allows for easy substitution to suit other generation tasks. Additionally, we provide a comprehensive evaluation of DAVA, one focus is the effectiveness of generated data in improving the performance of downstream tasks. During the generation process, we discover the Credibility Erosion Effect by modifying the generation rules, revealing a social phenomenon in social network propagation. \ No newline at end of file diff --git a/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving b/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving new file mode 100644 index 0000000000..b20a14d0ab --- /dev/null +++ b/data/2024/aaai/DALDet: Depth-Aware Learning Based Object Detection for Autonomous Driving @@ -0,0 +1 @@ +3D object detection achieves good detection performance in autonomous driving. However, it requires substantial computational resources, which prevents its practical application. 2D object detection has less computational burden but lacks spatial and geometric information embedded in depth. Therefore, we present DALDet, an efficient depth-aware learning based 2D detector, achieving high-performance object detection for autonomous driving. We design an efficient one-stage detection framework and seamlessly integrate depth cues into convolutional neural network by introducing depth-aware convolution and depth-aware average pooling, which effectively improve the detector's ability to perceive 3D space. Moreover, we propose a depth-guided loss function for training DALDet, which effectively improves the localization ability of the detector. Due to the use of depth map, DALDet can also output the distance of the object, which is of great importance for driving applications such as obstacle avoidance. Extensive experiments demonstrate the superiority and efficiency of DALDet. In particular, our DALDet ranks 1st on both KITTI Car and Cyclist 2D detection test leaderboards among all 2D detectors with high efficiency as well as yielding competitive performance among many leading 3D detectors. Code will be available at https://github.com/hukefy/DALDet. \ No newline at end of file diff --git a/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation b/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation new file mode 100644 index 0000000000..ebbd705d4c --- /dev/null +++ b/data/2024/aaai/DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time Adaptation @@ -0,0 +1 @@ +As an up-and-coming area, CLIP-based pre-trained vision-language models can readily facilitate downstream tasks through the zero-shot or few-shot fine-tuning manners. However, they still face critical challenges in test-time generalization due to the shifts between the training and test data distributions, hindering the further improvement of the performance. To address this crucial problem, the latest works have introduced Test-Time Adaptation (TTA) techniques to CLIP which dynamically learn text prompts using only test samples. However, their limited learning capacity due to the overlook of visual modality information, and the underutilization of knowledge in previously seen test samples result in reduced performance. In this paper, we propose a novel Dual-modal Adaptive online prompting and knowledge ReTention method called DART to overcome these challenges. To increase the learning capacity, DART captures knowledge from each test sample by learning class-specific text prompts and instance-level image prompts. Additionally, to fully leverage the knowledge from previously seen test samples, DART utilizes dual-modal knowledge retention prompts to adaptively retain the acquired knowledge, thereby enhancing the predictions on subsequent test samples. Extensive experiments on various large-scale benchmarks demonstrate the effectiveness of our proposed DART against state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification b/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification new file mode 100644 index 0000000000..ff4cff642b --- /dev/null +++ b/data/2024/aaai/DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification @@ -0,0 +1 @@ +Neural architecture search-based multi-modal classification (NAS-MMC) methods can individually obtain the optimal classifier for different multi-modal data sets in an automatic manner. However, most existing NAS-MMC methods are dramatically time consuming due to the requirement for training and evaluating enormous models. In this paper, we propose an efficient evolutionary-based NAS-MMC method called divide-and-conquer neural architecture search (DC-NAS). Specifically, the evolved population is first divided into k+1 sub-populations, and then k sub-populations of them evolve on k small-scale data sets respectively that are obtained by splitting the entire data set using the k-fold stratified sampling technique; the remaining one evolves on the entire data set. To solve the sub-optimal fusion model problem caused by the training strategy of partial data, two kinds of sub-populations that are trained using partial data and entire data exchange the learned knowledge via two special knowledge bases. With the two techniques mentioned above, DC-NAS achieves the training time reduction and classification performance improvement. Experimental results show that DC-NAS achieves the state-of-the-art results in term of classification performance, training efficiency and the number of model parameters than the compared NAS-MMC methods on three popular multi-modal tasks including multi-label movie genre classification, action recognition with RGB and body joints and dynamic hand gesture recognition. \ No newline at end of file diff --git a/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning b/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning new file mode 100644 index 0000000000..ff4aa682e7 --- /dev/null +++ b/data/2024/aaai/DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning @@ -0,0 +1 @@ +Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictor-based approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in utilizing predictors for NAS is to train a high-performance predictor using as few trained neural networks as possible. Although some methods attempt to address this problem through unsupervised learning, they often result in inaccurate predictions. We argue that the unsupervised tasks intended for the common graph data are too challenging for neural networks, causing unsupervised training to be susceptible to performance crashes in NAS. To address this issue, we propose a CurricuLum-guided Contrastive Learning framework for neural Predictor (DCLP). Our method simplifies the contrastive task by designing a novel curriculum to enhance the stability of unlabeled training data distribution during contrastive training. Specifically, we propose a scheduler that ranks the training data according to the contrastive difficulty of each data and then inputs them to the contrastive learner in order. This approach concentrates the training data distribution and makes contrastive training more efficient. By using our method, the contrastive learner incrementally learns feature representations via unsupervised data on a smooth learning curve, avoiding performance crashes that may occur with excessively variable training data distributions. We experimentally demonstrate that DCLP has high accuracy and efficiency compared with existing predictors, and shows promising potential to discover superior architectures in various search spaces when combined with search strategies. Our code is available at: https://github.com/Zhengsh123/DCLP. \ No newline at end of file diff --git a/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models b/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models new file mode 100644 index 0000000000..d87219f908 --- /dev/null +++ b/data/2024/aaai/DCV2I: A Practical Approach for Supporting Geographers' Visual Interpretation in Dune Segmentation with Deep Vision Models @@ -0,0 +1 @@ +Visual interpretation is extremely important in human geography as the primary technique for geographers to use photograph data in identifying, classifying, and quantifying geographic and topological objects or regions. However, it is also time-consuming and requires overwhelming manual effort from professional geographers. This paper describes our interdisciplinary team's efforts in integrating computer vision models with geographers' visual image interpretation process to reduce their workload in interpreting images. Focusing on the dune segmentation task, we proposed an approach featuring a deep dune segmentation model to identify dunes and label their ranges in an automated way. By developing a tool to connect our model with ArcGIS, one of the most popular workbenches for visual interpretation, geographers can further refine the automatically-generated dune segmentation on images without learning any CV or deep learning techniques. Our approach thus realized a non-invasive change to geographers' visual interpretation routines, reducing their manual efforts while incurring minimal interruptions to their work routines and tools they are familiar with. Deployment with a leading Chinese geography research institution demonstrated the potential of our approach in supporting geographers in researching and solving drylands desertification. \ No newline at end of file diff --git a/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining b/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining new file mode 100644 index 0000000000..4235f11620 --- /dev/null +++ b/data/2024/aaai/DDAE: Towards Deep Dynamic Vision BERT Pretraining @@ -0,0 +1 @@ +Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose a novel deep dynamic supervision to enable MIM methods to dynamically reconstruct patches with different degrees of difficulty at different pretraining phases and depths of the model. Our deep dynamic supervision helps to provide more locality inductive bias for ViTs especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Built upon the deep dynamic supervision, we propose Deep Dynamic AutoEncoder (DDAE), a simple yet effective MIM framework that utilizes dynamic mechanisms for pixel regression and feature self-distillation simultaneously. Extensive experiments across a variety of vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract) b/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract) new file mode 100644 index 0000000000..a147acfe09 --- /dev/null +++ b/data/2024/aaai/DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract) @@ -0,0 +1 @@ +With the help of Vision transformers (ViTs), medical image segmentation was able to achieve outstanding performance. In particular, they overcome the limitation of convolutional neural networks (CNNs) which rely on local receptive fields. ViTs use self-attention mechanisms to consider relationships between all image pixels or patches simultaneously. However, they require large datasets for training and did not perform well on capturing low-level features. To that end, we propose DDViT, a novel ViT model that unites a CNN to alleviate data-hunger for medical image segmentation with two multi-scale feature representations. Significantly, our approach incorporates a ViT with a plug-in domain adapter (DA) with Double-Level Fusion (DLF) technique, complemented by a mutual knowledge distillation paradigm, facilitating the seamless exchange of knowledge between a universal network and specialized domain-specific network branches. The DLF framework plays a pivotal role in our encoder-decoder architecture, combining the innovation of the TransFuse module with a robust CNN-based encoder. Extensive experimentation across diverse medical image segmentation datasets underscores the remarkable efficacy of DDViT when compared to alternative approaches based on CNNs and Transformer-based models. \ No newline at end of file diff --git a/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection b/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection new file mode 100644 index 0000000000..5a2c9583e9 --- /dev/null +++ b/data/2024/aaai/DGA-GNN: Dynamic Grouping Aggregation GNN for Fraud Detection @@ -0,0 +1,2 @@ +Fraud detection has increasingly become a prominent research field due to the dramatically increased incidents of fraud. The complex connections involving thousands, or even millions of nodes, present challenges for fraud detection tasks. Many researchers have developed various graph-based methods to detect fraud from these intricate graphs. However, those methods neglect two distinct characteristics of the fraud graph: the non-additivity of certain attributes and the distinguishability of grouped messages from neighbor nodes. +This paper introduces the Dynamic Grouping Aggregation Graph Neural Network (DGA-GNN) for fraud detection, which addresses these two characteristics by dynamically grouping attribute value ranges and neighbor nodes. In DGA-GNN, we initially propose the decision tree binning encoding to transform non-additive node attributes into bin vectors. This approach aligns well with the GNN’s aggregation operation and avoids nonsensical feature generation. Furthermore, we devise a feedback dynamic grouping strategy to classify graph nodes into two distinct groups and then employ a hierarchical aggregation. This method extracts more discriminative features for fraud detection tasks. Extensive experiments on five datasets suggest that our proposed method achieves a 3% ~ 16% improvement over existing SOTA methods. Code is available at https://github.com/AtwoodDuan/DGA-GNN. \ No newline at end of file diff --git a/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization b/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization new file mode 100644 index 0000000000..b61eeb167c --- /dev/null +++ b/data/2024/aaai/DGCLUSTER: A Neural Framework for Attributed Graph Clustering via Modularity Maximization @@ -0,0 +1 @@ +Graph clustering is a fundamental and challenging task in the field of graph mining where the objective is to group the nodes into clusters taking into consideration the topology of the graph. It has several applications in diverse domains spanning social network analysis, recommender systems, computer vision, and bioinformatics. In this work, we propose a novel method, DGCluster, which primarily optimizes the modularity objective using graph neural networks and scales linearly with the graph size. Our method does not require the number of clusters to be specified as a part of the input and can also leverage the availability of auxiliary node level information. We extensively test DGCluster on several real-world datasets of varying sizes, across multiple popular cluster quality metrics. Our approach consistently outperforms the state-of-the-art methods, demonstrating significant performance gains in almost all settings. \ No newline at end of file diff --git a/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval b/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval new file mode 100644 index 0000000000..cac484ab2c --- /dev/null +++ b/data/2024/aaai/DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval @@ -0,0 +1 @@ +Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL. \ No newline at end of file diff --git a/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization b/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization new file mode 100644 index 0000000000..3454031b35 --- /dev/null +++ b/data/2024/aaai/DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization @@ -0,0 +1 @@ +Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency. \ No newline at end of file diff --git a/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning b/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning new file mode 100644 index 0000000000..515ccfdfe2 --- /dev/null +++ b/data/2024/aaai/DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning @@ -0,0 +1 @@ +Recent works attempt to extend Graph Convolution Networks (GCNs) to point clouds for classification and segmentation tasks. These works tend to sample and group points to create smaller point sets locally and mainly focus on extracting local features through GCNs, while ignoring the relationship between point sets. In this paper, we propose the Dynamic Hop Graph Convolution Network (DHGCN) for explicitly learning the contextual relationships between the voxelized point parts, which are treated as graph nodes. Motivated by the intuition that the contextual information between point parts lies in the pairwise adjacent relationship, which can be depicted by the hop distance of the graph quantitatively, we devise a novel self-supervised part-level hop distance reconstruction task and design a novel loss function accordingly to facilitate training. In addition, we propose the Hop Graph Attention (HGA), which takes the learned hop distance as input for producing attention weights to allow edge features to contribute distinctively in aggregation. Eventually, the proposed DHGCN is a plug-and-play module that is compatible with point-based backbone networks. Comprehensive experiments on different backbones and tasks demonstrate that our self-supervised method achieves state-of-the-art performance. Our source codes are available at: https://github.com/Jinec98/DHGCN. \ No newline at end of file diff --git a/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection b/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection new file mode 100644 index 0000000000..358f412a75 --- /dev/null +++ b/data/2024/aaai/DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection @@ -0,0 +1 @@ +Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X. \ No newline at end of file diff --git a/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation b/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation new file mode 100644 index 0000000000..1431420b9d --- /dev/null +++ b/data/2024/aaai/DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation @@ -0,0 +1 @@ +Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction-following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs. \ No newline at end of file diff --git a/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling b/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling new file mode 100644 index 0000000000..ba69b1cf36 --- /dev/null +++ b/data/2024/aaai/DISCount: Counting in Large Image Collections with Detector-Based Importance Sampling @@ -0,0 +1 @@ +Many applications use computer vision to detect and count objects in massive image collections. However, automated methods may fail to deliver accurate counts, especially when the task is very difficult or requires a fast response time. For example, during disaster response, aid organizations aim to quickly count damaged buildings in satellite images to plan relief missions, but pre-trained building and damage detectors often perform poorly due to domain shifts. In such cases, there is a need for human-in-the-loop approaches to accurately count with minimal human effort. We propose DISCount -- a detector-based importance sampling framework for counting in large image collections. DISCount uses an imperfect detector and human screening to estimate low-variance unbiased counts. We propose techniques for counting over multiple spatial or temporal regions using a small amount of screening and estimate confidence intervals. This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in real-world applications. We demonstrate our method with two applications: counting birds in radar imagery to understand responses to climate change, and counting damaged buildings in satellite imagery for damage assessment in regions struck by a natural disaster. On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators. DISCount leads to a 9-12x reduction in the labeling costs to obtain the same error rates compared to naive screening for tasks we consider, and surpasses alternative covariate-based screening approaches. \ No newline at end of file diff --git a/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization b/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization new file mode 100644 index 0000000000..dff4a9d81c --- /dev/null +++ b/data/2024/aaai/DIUSum: Dynamic Image Utilization for Multimodal Summarization @@ -0,0 +1 @@ +Existing multimodal summarization approaches focus on fusing image features in the encoding process, ignoring the individualized needs for images when generating different summaries. However, whether intuitively or empirically, not all images can improve summary quality. Therefore, we propose a novel Dynamic Image Utilization framework for multimodal Summarization (DIUSum) to select and utilize valuable images for summarization. First, to predict whether an image helps produce a high-quality summary, we propose an image selector to score the usefulness of each image. Second, to dynamically utilize the multimodal information, we incorporate the hard and soft guidance from the image selector. Under the guidance, the image information is plugged into the decoder to generate a summary. Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. Further analysis demonstrates that the image selector can reflect the improved level of summary quality brought by the images. \ No newline at end of file diff --git a/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos b/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos new file mode 100644 index 0000000000..9b4f429380 --- /dev/null +++ b/data/2024/aaai/DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular Videos @@ -0,0 +1 @@ +Reconstructing a dynamic human with loose clothing is an important but difficult task. To address this challenge, we propose a method named DLCA-Recon to create human avatars from monocular videos. The distance from loose clothing to the underlying body rapidly changes in every frame when the human freely moves and acts. Previous methods lack effective geometric initialization and constraints for guiding the optimization of deformation to explain this dramatic change, resulting in the discontinuous and incomplete reconstruction surface.To model the deformation more accurately, we propose to initialize an estimated 3D clothed human in the canonical space, as it is easier for deformation fields to learn from the clothed human than from SMPL.With both representations of explicit mesh and implicit SDF, we utilize the physical connection information between consecutive frames and propose a dynamic deformation field (DDF) to optimize deformation fields. DDF accounts for contributive forces on loose clothing to enhance the interpretability of deformations and effectively capture the free movement of loose clothing. Moreover, we propagate SMPL skinning weights to each individual and refine pose and skinning weights during the optimization to improve skinning transformation. Based on more reasonable initialization and DDF, we can simulate real-world physics more accurately. Extensive experiments on public and our own datasets validate that our method can produce superior results for humans with loose clothing compared to the SOTA methods. \ No newline at end of file diff --git a/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation b/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation new file mode 100644 index 0000000000..a3552de1c7 --- /dev/null +++ b/data/2024/aaai/DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation @@ -0,0 +1 @@ +This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in long-tailed classification problems, also exists in monocular depth estimation, indicating that the imbalanced depth distribution in training data may be the cause of limited generalization ability. Second, the imbalanced and long-tail distribution of depth values extends beyond the dataset scale, and also manifests within each individual image, further exacerbating the challenge of monocular depth estimation. Motivated by the above findings, we propose the Distance-aware Multi-Expert (DME) depth estimation model. Unlike prior methods that handle different depth range indiscriminately, DME adopts a divide-and-conquer philosophy where each expert is responsible for depth estimation of regions within a specific depth range. As such, the depth distribution seen by each expert is more uniform and can be more easily predicted. A pixel-level routing module is further designed and learned to stitch the prediction of all experts into the final depth map. Experiments show that DME achieves state-of-the-art performance on both NYU-Depth v2 and KITTI, and also delivers favorable zero-shot generalization capability on unseen datasets. \ No newline at end of file diff --git a/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction b/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction new file mode 100644 index 0000000000..18b5f23235 --- /dev/null +++ b/data/2024/aaai/DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction @@ -0,0 +1 @@ +Electroencephalography (EEG) has proven to be effective in emotion analysis. However, current methods struggle with individual variations, complicating the generalization of models trained on data from source subjects to unseen target subjects. To tackle this issue, we propose the Denoising Mixed Mutual Reconstruction (DMMR) model, employing a two-stage pre-training followed by fine-tuning approach. During the pre-training phase, DMMR leverages self-supervised learning through a multi-decoder autoencoder, which encodes and reconstructs features of one subject, aiming to generate features resembling those from other subjects within the same category, thereby encouraging the encoder to learn subject-invariant features. We introduce a hidden-layer mixed data augmentation approach to mitigate the limitations posed by the scarcity of source data, thereby extending the method to a two-stage process. To bolster stability against noise, we incorporate a noise injection method, named “Time Steps Shuffling”, into the input data. During the fine-tuning phase, an emotion classifier is integrated to extract emotion-related features. Experimental accuracy on the SEED and SEED-IV datasets reached 88.27% (±5.62) and 72.70% (±8.01), respectively, demonstrating state-of-the-art and comparable performance, thereby showcasing the superiority of DMMR. The proposed data augmentation and noise injection methods were observed to complementarily enhance accuracy and stability, thus alleviating the aforementioned issues. \ No newline at end of file diff --git a/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract) b/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract) new file mode 100644 index 0000000000..730812bf13 --- /dev/null +++ b/data/2024/aaai/DNIT: Enhancing Day-Night Image-to-Image Translation through Fine-Grained Feature Handling (Student Abstract) @@ -0,0 +1 @@ +Existing image-to-image translation methods perform less satisfactorily in the "day-night" domain due to insufficient scene feature study. To address this problem, we propose DNIT, which performs fine-grained handling of features by a nighttime image preprocessing (NIP) module and an edge fusion detection (EFD) module. The NIP module enhances brightness while minimizing noise, facilitating extracting content and style features. Meanwhile, the EFD module utilizes two types of edge images as additional constraints to optimize the generator. Experimental results show that we can generate more realistic and higher-quality images compared to other methods, proving the effectiveness of our DNIT. \ No newline at end of file diff --git a/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding b/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding new file mode 100644 index 0000000000..ed46a3e770 --- /dev/null +++ b/data/2024/aaai/DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding @@ -0,0 +1 @@ +Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR. \ No newline at end of file diff --git a/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training b/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training new file mode 100644 index 0000000000..2a278fe073 --- /dev/null +++ b/data/2024/aaai/DOGE-Train: Discrete Optimization on GPU with End-to-End Training @@ -0,0 +1 @@ +We present a fast, scalable, data-driven approach for solving relaxations of 0-1 integer linear programs. We use a combination of graph neural networks (GNN) and a Lagrange decomposition based algorithm. We make the latter differentiable for end-to-end training and use GNNs to predict its algorithmic parameters. This allows to retain the algorithm's theoretical properties including dual feasibility and guaranteed non-decrease in the lower bound while improving it via training. We overcome suboptimal fixed points of the basic solver by additional non-parametric GNN update steps maintaining dual feasibility. For training we use an unsupervised loss. We train on smaller problems and test on larger ones showing strong generalization performance with a GNN comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version, achieving close to optimal objective values of LP relaxations of very large structured prediction problems and on selected combinatorial ones. In particular, we achieve better objective values than specialized approximate solvers for specific problem classes while retaining their efficiency. Our solver has better any-time performance over a large time period compared to a commercial solver. \ No newline at end of file diff --git a/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) b/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) new file mode 100644 index 0000000000..df74839645 --- /dev/null +++ b/data/2024/aaai/DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction) @@ -0,0 +1 @@ +The Adam optimizer is a popular choice in contemporary deep learning due to its strong empirical performance. However we observe that in privacy sensitive scenarios, the traditional use of Differential Privacy (DP) with the Adam optimizer leads to sub-optimal performance on several tasks. We find that this performance degradation is due to a DP bias in Adam's second moment estimator, introduced by the addition of independent noise in the gradient computation to enforce DP guarantees. This DP bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam, and Adam's sign descent interpretation. We propose the DP-AdamBC optimization algorithm, which corrects for the bias in the second moment estimation and retrieves the expected behaviour of Adam. Empirically, DP-AdamBC significantly improves the optimization performance of DP-Adam by up to 3.5% in final accuracy in image, text, and graph node classification tasks. \ No newline at end of file diff --git a/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection b/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection new file mode 100644 index 0000000000..fb60c067a0 --- /dev/null +++ b/data/2024/aaai/DPA-P2PNet: Deformable Proposal-Aware P2PNet for Accurate Point-Based Cell Detection @@ -0,0 +1 @@ +Point-based cell detection (PCD), which pursues high-performance cell sensing under low-cost data annotation, has garnered increased attention in computational pathology community. Unlike mainstream PCD methods that rely on intermediate density map representations, the Point-to-Point network (P2PNet) has recently emerged as an end-to-end solution for PCD, demonstrating impressive cell detection accuracy and efficiency. Nevertheless, P2PNet is limited to decoding from a single-level feature map due to the scale-agnostic property of point proposals, which is insufficient to leverage multi-scale information. Moreover, the spatial distribution of pre-set point proposals is biased from that of cells, leading to inaccurate cell localization. To lift these limitations, we present DPA-P2PNet in this work. The proposed method directly extracts multi-scale features for decoding according to the coordinates of point proposals on hierarchical feature maps. On this basis, we further devise deformable point proposals to mitigate the positional bias between proposals and potential cells to promote cell localization. Inspired by practical pathological diagnosis that usually combines high-level tissue structure and low-level cell morphology for accurate cell classification, we propose a multi-field-of-view (mFoV) variant of DPA-P2PNet to accommodate additional large FoV images with tissue information as model input. Finally, we execute the first self-supervised pre-training on immunohistochemistry histopathology image data and evaluate the suitability of four representative self-supervised methods on the PCD task. Experimental results on three benchmarks and a large-scale and real-world interval dataset demonstrate the superiority of our proposed models over the state-of-the-art counterparts. Codes and pre-trained weights are available at https://github.com/windygoo/DPA-P2PNet. \ No newline at end of file diff --git a/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract) b/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract) new file mode 100644 index 0000000000..3929908cdc --- /dev/null +++ b/data/2024/aaai/DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract) @@ -0,0 +1 @@ +Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms. \ No newline at end of file diff --git a/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems b/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems new file mode 100644 index 0000000000..53f59e4dae --- /dev/null +++ b/data/2024/aaai/DR-Label: Label Deconstruction and Reconstruction of GNN Models for Catalysis Systems @@ -0,0 +1 @@ +Attaining the equilibrium geometry of a catalyst-adsorbate system is key to fundamentally assessing its effective properties, such as adsorption energy. While machine learning methods with advanced representation or supervision strategies have been applied to boost and guide the relaxation processes of catalysis systems, existing methods that produce linearly aggregated geometry predictions are susceptible to edge representations ambiguity, and are therefore vulnerable to graph variations. In this paper, we present a novel graph neural network (GNN) supervision and prediction strategy DR-Label. Our approach mitigates the multiplicity of solutions in edge representation and encourages model predictions that are independent of graph structural variations. DR-Label first Deconstructs finer-grained equilibrium state information to the model by projecting the node-level supervision signal to each edge. Reversely, the model Reconstructs a more robust equilibrium state prediction by converting edge-level predictions to node-level via a sphere-fitting algorithm. When applied to three fundamentally different models, DR-Label consistently enhanced performance. Leveraging the graph structure invariance of the DR-Label strategy, we further propose DRFormer, which applied explicit intermediate positional update and achieves a new state-of-the-art performance on the Open Catalyst 2020 (OC20) dataset and the Cu-based single-atom alloys CO adsorption (SAA) dataset. We expect our work to highlight vital principles for advancing geometric GNN models for catalysis systems and beyond. Our code is available at https://github.com/bowenwang77/DR-Label \ No newline at end of file diff --git a/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework b/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework new file mode 100644 index 0000000000..5abea6f378 --- /dev/null +++ b/data/2024/aaai/DRF: Improving Certified Robustness via Distributional Robustness Framework @@ -0,0 +1 @@ +Randomized smoothing (RS) has provided state-of-the-art (SOTA) certified robustness against adversarial perturbations for large neural networks. Among studies in this field, methods based on adversarial training (AT) achieve remarkably robust performance by applying adversarial examples to construct the smoothed classifier. These AT-based RS methods typically seek a pointwise adversary that generates the worst-case adversarial examples by perturbing each input independently. However, there are unexplored benefits to considering such adversarial robustness across the entire data distribution. To this end, we provide a novel framework called DRF, which connects AT-based RS methods with distributional robustness (DR), and show that these methods are special cases of their counterparts in our framework. Due to the advantages conferred by DR, our framework can control the trade-off between the clean accuracy and certified robustness of smoothed classifiers to a significant extent. Our experiments demonstrate that DRF can substantially improve the certified robustness of AT-based RS. \ No newline at end of file diff --git a/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning b/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning new file mode 100644 index 0000000000..adf8bc0147 --- /dev/null +++ b/data/2024/aaai/DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning @@ -0,0 +1 @@ +Class-incremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning. \ No newline at end of file diff --git "a/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?" "b/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?" new file mode 100644 index 0000000000..c27b78065a --- /dev/null +++ "b/data/2024/aaai/DSD\302\262: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?" @@ -0,0 +1,3 @@ +Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. + +In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2. \ No newline at end of file diff --git a/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification b/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification new file mode 100644 index 0000000000..a2af54700f --- /dev/null +++ b/data/2024/aaai/DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification @@ -0,0 +1,3 @@ +Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. +Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. +The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git \ No newline at end of file diff --git a/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition b/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition new file mode 100644 index 0000000000..0c069f16e0 --- /dev/null +++ b/data/2024/aaai/DTL: Disentangled Transfer Learning for Visual Recognition @@ -0,0 +1 @@ +When pre-trained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation b/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation new file mode 100644 index 0000000000..e5b5c26867 --- /dev/null +++ b/data/2024/aaai/DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation @@ -0,0 +1 @@ +Despite the great potential in capturing long-range dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer. \ No newline at end of file diff --git a/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning b/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning new file mode 100644 index 0000000000..b6512814d1 --- /dev/null +++ b/data/2024/aaai/DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning @@ -0,0 +1 @@ +Recent machine learning algorithms have been developed using well-curated datasets, which often require substantial cost and resources. On the other hand, the direct use of raw data often leads to overfitting towards frequently occurring class information. To address class imbalances cost-efficiently, we propose an active data filtering process during self-supervised pre-training in our novel framework, Duplicate Elimination (DUEL). This framework integrates an active memory inspired by human working memory and introduces distinctiveness information, which measures the diversity of the data in the memory, to optimize both the feature extractor and the memory. The DUEL policy, which replaces the most duplicated data with new samples, aims to enhance the distinctiveness information in the memory and thereby mitigate class imbalances. We validate the effectiveness of the DUEL framework in class-imbalanced environments, demonstrating its robustness and providing reliable results in downstream tasks. We also analyze the role of the DUEL policy in the training process through various metrics and visualizations. \ No newline at end of file diff --git a/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition b/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition new file mode 100644 index 0000000000..0bd2087d60 --- /dev/null +++ b/data/2024/aaai/DVANet: Disentangling View and Action Features for Multi-View Action Recognition @@ -0,0 +1 @@ +In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5%, 4.8%, 2.2%, and 4.8% on each dataset, respectively. Our code can be found here: https://github.com/NyleSiddiqui/MultiView_Actions \ No newline at end of file diff --git a/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering b/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering new file mode 100644 index 0000000000..0443e4e16d --- /dev/null +++ b/data/2024/aaai/DVSAI: Diverse View-Shared Anchors Based Incomplete Multi-View Clustering @@ -0,0 +1,2 @@ +In numerous real-world applications, it is quite common that sample information is partially available for some views due to machine breakdown or sensor failure, causing the problem of incomplete multi-view clustering (IMVC). While several IMVC approaches using view-shared anchors have successfully achieved pleasing performance improvement, (1) they generally construct anchors with only one dimension, which could deteriorate the multi-view diversity, bringing about serious information loss; (2) the constructed anchors are typically with a single size, which could not sufficiently characterize the distribution of the whole samples, leading to limited clustering performance. For generating view-shared anchors with multi-dimension and multi-size for IMVC, we design a novel framework called Diverse View-Shared Anchors based Incomplete multi-view clustering (DVSAI). Concretely, we associate each partial view with several potential spaces. +In each space, we enable anchors to communicate among views and generate the view-shared anchors with space-specific dimension and size. Consequently, spaces with various scales make the generated view-shared anchors enjoy diverse dimensions and sizes. Subsequently, we devise an integration scheme with linear computational and memory expenditures to integrate the outputted multi-scale unified anchor graphs such that running spectral algorithm generates the spectral embedding. Afterwards, we theoretically demonstrate that DVSAI owns linear time and space costs, thus well-suited for tackling large-size datasets. Finally, comprehensive experiments confirm the effectiveness and advantages of DVSAI. \ No newline at end of file diff --git a/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning b/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning new file mode 100644 index 0000000000..52e31f8713 --- /dev/null +++ b/data/2024/aaai/DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning @@ -0,0 +1 @@ +We present DanceAnyWay, a generative learning method to synthesize beat-guided dances of 3D human characters synchronized with music. Our method learns to disentangle the dance movements at the beat frames from the dance movements at all the remaining frames by operating at two hierarchical levels. At the coarser "beat" level, it encodes the rhythm, pitch, and melody information of the input music via dedicated feature representations only at the beat frames. It leverages them to synthesize the beat poses of the target dances using a sequence-to-sequence learning framework. At the finer "repletion" level, our method encodes similar rhythm, pitch, and melody information from all the frames of the input music via dedicated feature representations. It generates the full dance sequences by combining the synthesized beat and repletion poses and enforcing plausibility through an adversarial learning framework. Our training paradigm also enforces fine-grained diversity in the synthesized dances through a randomized temporal contrastive loss, which ensures different segments of the dance sequences have different movements and avoids motion freezing or collapsing to repetitive movements. We evaluate the performance of our approach through extensive experiments on the benchmark AIST++ dataset and observe improvements of about 7%-12% in motion quality metrics and 1.5%-4% in motion diversity metrics over the current baselines, respectively. We also conducted a user study to evaluate the visual quality of our synthesized dances. We noted that, on average, the samples generated by our method were about 9-48% more preferred by the participants and had a 4-27% better five-point Likert-scale score over the best available current baseline in terms of motion quality and synchronization. Our source code and project page are available at https://github.com/aneeshbhattacharya/DanceAnyWay. \ No newline at end of file diff --git a/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting b/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting new file mode 100644 index 0000000000..ce52e65945 --- /dev/null +++ b/data/2024/aaai/DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting @@ -0,0 +1 @@ +Dance is generally considered to be complex for most people as it requires coordination of numerous body motions and accurate responses to the musical content and rhythm. Studies on automatic dance performance assessment could help people improve their sensorimotor skills and promote research in many fields, including human motion analysis and motion generation. Recent papers on dance performance assessment usually evaluate simple dance motions with a single task - estimating final performance scores. In this paper, we propose DanceMVP: multi-task dance performance assessment via text prompting that solves three related tasks - (i) dance vocabulary recognition, (ii) dance performance scoring and (iii) dance rhythm evaluation. In the pre-training phase, we contrastively learn the primitive-based features of complex dance motion and music using the InfoNCE loss. For the downstream task, we propose a transformer-based text prompter to perform multi-task evaluations for the three proposed assessment tasks. Also, we build a multimodal dance-music dataset named ImperialDance. The novelty of our ImperialDance is that it contains dance motions for diverse expertise levels and a significant amount of repeating dance sequences for the same choreography to keep track of the dance performance progression. Qualitative results show that our pre-trained feature representation could cluster dance pieces for different dance genres, choreographies, expertise levels and primitives, which generalizes well on both ours and other dance-music datasets. The downstream experiments demonstrate the robustness and improvement of our method over several ablations and baselines across all three tasks, as well as monitoring the users' dance level progression. \ No newline at end of file diff --git a/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification b/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification new file mode 100644 index 0000000000..222c8d5139 --- /dev/null +++ b/data/2024/aaai/Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification @@ -0,0 +1 @@ +Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets’ performance over traditional adaptation methods by simply. \ No newline at end of file diff --git a/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection b/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection new file mode 100644 index 0000000000..0095c929f3 --- /dev/null +++ b/data/2024/aaai/Data Augmented Graph Neural Networks for Personality Detection @@ -0,0 +1 @@ +Personality detection is a fundamental task for user psychology research. One of the biggest challenges in personality detection lies in the quantitative limitation of labeled data collected by completing the personality questionnaire, which is very time-consuming and labor-intensive. Most of the existing works are mainly devoted to learning the rich representations of posts based on labeled data. However, they still suffer from the inherent weakness of the amount limitation of labels, which potentially restricts the capability of the model to deal with unseen data. In this paper, we construct a heterogeneous personality graph for each labeled and unlabeled user and develop a novel psycholinguistic augmented graph neural network to detect personality in a semi-supervised manner, namely Semi-PerGCN. Specifically, our model first explores a supervised Personality Graph Neural Network (PGNN) to refine labeled user representation on the heterogeneous graph. For the remaining massive unlabeled users, we utilize the empirical psychological knowledge of the Linguistic Inquiry and Word Count (LIWC) lexicon for multi-view graph augmentation and perform unsupervised graph consistent constraints on the parameters shared PGNN. During the learning process of finite labeled users, noise-invariant learning on a large scale of unlabeled users is combined to enhance the generalization ability. Extensive experiments on three real-world datasets, Youtube, PAN2015, and MyPersonality demonstrate the effectiveness of our Semi-PerGCN in personality detection, especially in scenarios with limited labeled users. \ No newline at end of file diff --git a/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets b/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets new file mode 100644 index 0000000000..4a0326098d --- /dev/null +++ b/data/2024/aaai/Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets @@ -0,0 +1 @@ +Predictive maintenance has emerged as a critical application in modern transportation, leveraging sensor data to forecast potential damages proactively using machine learning. However, privacy concerns limit data sharing, making Federated learning an appealing approach to preserve data privacy. Nevertheless, challenges arise due to disparities in data distribution and temporal unavailability caused by individual usage patterns in transportation. In this paper, we present a novel asynchronous federated learning approach to address system heterogeneity and facilitate machine learning for predictive maintenance on transportation fleets. The approach introduces a novel data disparity aware aggregation scheme and a federated early stopping method for training. To validate the effectiveness of our approach, we evaluate it on two independent real-world datasets from the transportation domain: 1) oil dilution prediction of car combustion engines and 2) remaining lifetime prediction of plane turbofan engines. Our experiments show that we reliably outperform five state-of-the-art baselines, including federated and classical machine learning models. Moreover, we show that our approach generalises to various prediction model architectures. \ No newline at end of file diff --git a/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition b/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition new file mode 100644 index 0000000000..f5cabf843f --- /dev/null +++ b/data/2024/aaai/Data Distribution Distilled Generative Model for Generalized Zero-Shot Recognition @@ -0,0 +1 @@ +In the realm of Zero-Shot Learning (ZSL), we address biases in Generalized Zero-Shot Learning (GZSL) models, which favor seen data. To counter this, we introduce an end-to-end generative GZSL framework called D3GZSL. This framework respects seen and synthesized unseen data as in-distribution and out-of-distribution data, respectively, for a more balanced model. D3GZSL comprises two core modules: in-distribution dual space distillation (ID2SD) and out-of-distribution batch distillation (O2DBD). ID2SD aligns teacher-student outcomes in embedding and label spaces, enhancing learning coherence. O2DBD introduces low-dimensional out-of-distribution representations per batch sample, capturing shared structures between seen and un seen categories. Our approach demonstrates its effectiveness across established GZSL benchmarks, seamlessly integrating into mainstream generative frameworks. Extensive experiments consistently showcase that D3GZSL elevates the performance of existing generative GZSL methods, under scoring its potential to refine zero-shot learning practices. The code is available at: https://github.com/PJBQ/D3GZSL.git \ No newline at end of file diff --git a/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems b/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems new file mode 100644 index 0000000000..0995e860cf --- /dev/null +++ b/data/2024/aaai/Data Efficient Paradigms for Personalized Assessment of Black-Box Taskable AI Systems @@ -0,0 +1 @@ +The vast diversity of internal designs of taskable black-box AI systems and their nuanced zones of safe functionality make it difficult for a layperson to use them without unintended side effects. My dissertation focuses on developing paradigms that enable a user to assess and understand the limits of an AI system's safe operability. We develop a personalized AI assessment module that lets an AI system execute instruction sequences in simulators and answer queries about these executions. Our results show that such a primitive query-response interface is sufficient to efficiently derive a user-interpretable model of a system's capabilities. \ No newline at end of file diff --git a/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games b/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games new file mode 100644 index 0000000000..9701b0f778 --- /dev/null +++ b/data/2024/aaai/Data Poisoning to Fake a Nash Equilibria for Markov Games @@ -0,0 +1 @@ +We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside it. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval b/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval new file mode 100644 index 0000000000..e3f5c89c8e --- /dev/null +++ b/data/2024/aaai/Data Roaming and Quality Assessment for Composed Image Retrieval @@ -0,0 +1,2 @@ +The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. +We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR. \ No newline at end of file diff --git a/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance b/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance new file mode 100644 index 0000000000..ddcbae200d --- /dev/null +++ b/data/2024/aaai/Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance @@ -0,0 +1,2 @@ +Pretrained large models, particularly large language models, have garnered increasing attention, as they have demonstrated remarkable abilities through contextual learning. Pretrained large models are increasingly recognized as fundamental tools for solving various tasks. However, the substantial computational demands of large models have dissuaded most product teams and individuals from running them. In such scenarios, to leverage the exceptional performance of large models, one must solely depend on costly APIs, further burdening product teams and individuals. On the other hand, despite the overall inferior performance of small models compared to large models, there are certain distributions where small models can achieve comparable or even superior results. For instance, during training, small models may become trapped in a local optimum that is unique to certain distributions, leading to superior performance. Hence, we propose Data Shunt (DS), a general paradigm for collaboration of small and large models. DS not only substantially reduces the cost associated with deploying large models but also effectively enhances overall performance. Specifically, DS determines the shunting direction by evaluating the confidence level of small models. When the confidence level falls below a specific threshold, the input data is forwarded to large models. To further leverage the advantages of the small and large models, we introduce Prompt Pruning (PP) and 2-Stage Confidence Distillation (2CD), which facilitate mutual collaboration, leading to better results and less cost. +The remarkable performance across diverse modalities and tasks demonstrates the superiority of the proposed DS over large models. For instance, ChatGPT achieves an accuracy of 94.43% on Amazon Product sentiment analysis, and DS achieves an accuracy of 95.64%, while the cost has been reduced to only 31.18%. The code for the proposed method are provided for research purposes https://github.com/Anfeather/Data-Shunt. \ No newline at end of file diff --git a/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts b/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts new file mode 100644 index 0000000000..c40fb10951 --- /dev/null +++ b/data/2024/aaai/Data-Augmented Curriculum Graph Neural Architecture Search under Distribution Shifts @@ -0,0 +1 @@ +Graph neural architecture search (NAS) has achieved great success in designing architectures for graph data processing.However, distribution shifts pose great challenges for graph NAS, since the optimal searched architectures for the training graph data may fail to generalize to the unseen test graph data. The sole prior work tackles this problem by customizing architectures for each graph instance through learning graph structural information, but failed to consider data augmentation during training, which has been proven by existing works to be able to improve generalization.In this paper, we propose Data-augmented Curriculum Graph Neural Architecture Search (DCGAS), which learns an architecture customizer with good generalizability to data under distribution shifts. Specifically, we design an embedding-guided data generator, which can generate sufficient graphs for training to help the model better capture graph structural information. In addition, we design a two-factor uncertainty-based curriculum weighting strategy, which can evaluate the importance of data in enabling the model to learn key information in real-world distribution and reweight them during training. Experimental results on synthetic datasets and real datasets with distribution shifts demonstrate that our proposed method learns generalizable mappings and outperforms existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract) b/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract) new file mode 100644 index 0000000000..f94f19c275 --- /dev/null +++ b/data/2024/aaai/Data-Driven Discovery of Design Specifications (Student Abstract) @@ -0,0 +1 @@ +Ensuring a machine learning model’s trustworthiness is crucial to prevent potential harm. One way to foster trust is through the formal verification of the model’s adherence to essential design requirements. However, this approach relies on well-defined, application-domain-centric criteria with which to test the model, and such specifications may be cumbersome to collect in practice. We propose a data-driven approach for creating specifications to evaluate a trained model effectively. Implementing this framework allows us to prove that the model will exhibit safe behavior while minimizing the false-positive prediction rate. This strategy enhances predictive accuracy and safety, providing deeper insight into the model’s strengths and weaknesses, and promotes trust through a systematic approach. \ No newline at end of file diff --git a/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions b/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions new file mode 100644 index 0000000000..764050e06f --- /dev/null +++ b/data/2024/aaai/Data-Driven Knowledge-Aware Inference of Private Information in Continuous Double Auctions @@ -0,0 +1 @@ +Inferring the private information of humans from their strategic behavioral data is crucial and challenging. The main approach is first obtaining human behavior functions (which map public information and human private information to behavior), enabling subsequent inference of private information from observed behavior. Most existing studies rely on strong equilibrium assumptions to obtain behavior functions. Our work focuses on continuous double auctions, where multiple traders with heterogeneous rationalities and beliefs dynamically trade commodities and deriving equilibria is generally intractable. We develop a knowledge-aware machine learning-based framework to infer each trader's private cost vectors for producing different units of its commodity. Our key idea is to learn behavior functions by incorporating the statistical knowledge about private costs given the observed trader asking behavior across the population. Specifically, we first use a neural network to characterize each trader's behavior function. Second, we leverage the statistical knowledge to derive the posterior distribution of each trader's private costs given its observed asks. Third, through designing a novel loss function, we utilize the knowledge-based posterior distributions to guide the learning of the neural network. We conduct extensive experiments on a large experimental dataset, and demonstrate the superior performance of our framework over baselines in inferring the private information of humans. \ No newline at end of file diff --git a/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties b/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties new file mode 100644 index 0000000000..d4563d6a8d --- /dev/null +++ b/data/2024/aaai/Data-Driven Structural Fire Risk Prediction for City Properties @@ -0,0 +1 @@ +Fire Departments conduct inspections to prevent fires but it is unclear how to best allocate their limited inspection resources across the properties in a city. Currently, they use their intuition and experience to decide on which properties to inspect and lack a data-driven approach that could lead to a more principled use of inspection resources. The main contribution of this paper is to investigate such an approach, based on machine learning for predicting a fire risk score for properties in a city based on historical fire-incident data. These scores can then be used to help prioritize inspection resources toward higher-risk properties. We present a case study using data from a South Dakota fire department which contains information about properties in a city along with records of fire in- incidents. We use this data consisting of more than 72,000 properties to train a machine learning model to predict fire risk and evaluate its ability to rank the fire risk of properties in the city. We conduct and analyze experiments with variations of XG-Boost, which is an algorithm well-suited to the challenges in application, including missing data and a highly-skewed class distribution. Our evaluation of the model-generated rankings, based on ranking metrics, shows that the model significantly outperforms random rankings and other natural baselines. We also analyze the feature importance computed for the models, which provides further insight into the model behavior. This model has been integrated into an interface for displaying the rankings across a city and is ready for beta testing. \ No newline at end of file diff --git a/data/2024/aaai/Data-Efficient Graph Learning b/data/2024/aaai/Data-Efficient Graph Learning new file mode 100644 index 0000000000..9902138c09 --- /dev/null +++ b/data/2024/aaai/Data-Efficient Graph Learning @@ -0,0 +1 @@ +My research strives to develop fundamental graph-centric learning algorithms to reduce the need for human supervision in low-resource scenarios. The focus is on achieving effective and reliable data-efficient learning on graphs, which can be summarized into three facets: (1) graph weakly-supervised learning; (2) graph few-shot learning; and (3) graph self-supervised learning. \ No newline at end of file diff --git a/data/2024/aaai/Data-Free Generalized Zero-Shot Learning b/data/2024/aaai/Data-Free Generalized Zero-Shot Learning new file mode 100644 index 0000000000..b6d673e7c5 --- /dev/null +++ b/data/2024/aaai/Data-Free Generalized Zero-Shot Learning @@ -0,0 +1 @@ +Deep learning models have the ability to extract rich knowledge from large-scale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by transferring semantic knowledge learned from base classes. However, traditional generative ZSL methods often require access to real images from base classes and rely on manually annotated attributes, which presents challenges in terms of data restrictions and model scalability. To this end, this paper tackles a challenging and practical problem dubbed as data-free zero-shot learning (DFZSL), where only the CLIP-based base classes data pre-trained classifier is available for zero-shot classification. Specifically, we propose a generic framework for DFZSL, which consists of three main components. Firstly, to recover the virtual features of the base data, we model the CLIP features of base class images as samples from a von Mises-Fisher (vMF) distribution based on the pre-trained classifier. Secondly, we leverage the text features of CLIP as low-cost semantic information and propose a feature-language prompt tuning (FLPT) method to further align the virtual image features and textual features. Thirdly, we train a conditional generative model using the well-aligned virtual image features and corresponding semantic text features, enabling the generation of new classes features and achieve better zero-shot generalization. Our framework has been evaluated on five commonly used benchmarks for generalized ZSL, as well as 11 benchmarks for the base-to-new ZSL. The results demonstrate the superiority and effectiveness of our approach. Our code is available in https://github.com/ylong4/DFZSL. \ No newline at end of file diff --git a/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack b/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack new file mode 100644 index 0000000000..bc2f0e51ff --- /dev/null +++ b/data/2024/aaai/Data-Free Hard-Label Robustness Stealing Attack @@ -0,0 +1 @@ +The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in security-sensitive scenarios, e.g, face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack. \ No newline at end of file diff --git a/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models b/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models new file mode 100644 index 0000000000..c02de63678 --- /dev/null +++ b/data/2024/aaai/DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models @@ -0,0 +1 @@ +Dataset sanitization is a widely adopted proactive defense against poisoning-based backdoor attacks, aimed at filtering out and removing poisoned samples from training datasets. However, existing methods have shown limited efficacy in countering the ever-evolving trigger functions, and often leading to considerable degradation of benign accuracy. In this paper, we propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Specifically, with multiple iterations of the forward and reverse process, we extract intermediary images and their predicted labels for each sample in the original dataset. Then, we identify anomalous samples in terms of the presence of label transition of the intermediary images, detect the target label by quantifying distribution discrepancy, select their purified images considering pixel and feature distance, and determine their ground-truth labels by training a benign model. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy, surpassing the performance of baseline defense methods. \ No newline at end of file diff --git a/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality b/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality new file mode 100644 index 0000000000..2fab3cfa55 --- /dev/null +++ b/data/2024/aaai/De-biased Attention Supervision for Text Classification with Causality @@ -0,0 +1 @@ +In text classification models, while the unsupervised attention mechanism can enhance performance, it often produces attention distributions that are puzzling to humans, such as assigning high weight to seemingly insignificant conjunctions. Recently, numerous studies have explored Attention Supervision (AS) to guide the model toward more interpretable attention distributions. However, such AS can impact classification performance, especially in specialized domains. In this paper, we address this issue from a causality perspective. Firstly, we leverage the causal graph to reveal two biases in the AS: 1) Bias caused by the label distribution of the dataset. 2) Bias caused by the words' different occurrence ranges that some words can occur across labels while others only occur in a particular label. We then propose a novel De-biased Attention Supervision (DAS) method to eliminate these biases with causal techniques. Specifically, we adopt backdoor adjustment on the label-caused bias and reduce the word-caused bias by subtracting the direct causal effect of the word. Through extensive experiments on two professional text classification datasets (e.g., medicine and law), we demonstrate that our method achieves improved classification accuracy along with more coherent attention distributions. \ No newline at end of file diff --git a/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning b/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning new file mode 100644 index 0000000000..bef7df4c7b --- /dev/null +++ b/data/2024/aaai/DeRDaVa: Deletion-Robust Data Valuation for Machine Learning @@ -0,0 +1 @@ +Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions. \ No newline at end of file diff --git a/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity b/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity new file mode 100644 index 0000000000..3d9ee4fb86 --- /dev/null +++ b/data/2024/aaai/DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity @@ -0,0 +1 @@ +Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset. \ No newline at end of file diff --git a/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning b/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning new file mode 100644 index 0000000000..2998378e4e --- /dev/null +++ b/data/2024/aaai/Dealing with Numeric and Metric Time Constraints in PDDL3 via Compilation to Numeric Planning @@ -0,0 +1,2 @@ +This paper studies an approach to planning with PDDL3 constraints involving mixed propositional and numeric conditions, as well as metric time constraints. +We show how the whole PDDL3 with instantaneous actions can be compiled away into a numeric planning problem without PDDL3 constraints, enabling the use of any state-of-the-art numeric planner that is agnostic to the existence of PDDL3. Our solution exploits the concept of regression. In addition to a basic compilation, we present an optimized variant based on the observation that it is possible to make the compilation sensitive to the structure of the problem to solve; this can be done by reasoning on the interactions between the problem actions and the constraints. The resulting optimization substantially reduces the size of the planning task. We experimentally observe that our approach significantly outperforms existing state-of-the-art planners supporting the same class of constraints over known benchmark domains, settling a new state-of-the-art planning system for PDDL3. \ No newline at end of file diff --git a/data/2024/aaai/Debiased Novel Category Discovering and Localization b/data/2024/aaai/Debiased Novel Category Discovering and Localization new file mode 100644 index 0000000000..c84bb27b7b --- /dev/null +++ b/data/2024/aaai/Debiased Novel Category Discovering and Localization @@ -0,0 +1 @@ +In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art. \ No newline at end of file diff --git a/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning b/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning new file mode 100644 index 0000000000..11ac0a9855 --- /dev/null +++ b/data/2024/aaai/Debiasing Multimodal Sarcasm Detection with Contrastive Learning @@ -0,0 +1 @@ +Despite commendable achievements made by existing work, prevailing multimodal sarcasm detection studies rely more on textual content over visual information. It unavoidably induces spurious correlations between textual words and labels, thereby significantly hindering the models' generalization capability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sarcasm detection, which aims to evaluate models' generalizability when the word distribution is different in training and testing settings. Moreover, we propose a novel debiasing multimodal sarcasm detection framework with contrastive learning, which aims to mitigate the harmful effect of biased textual factors for robust OOD generalization. In particular, we first design counterfactual data augmentation to construct the positive samples with dissimilar word biases and negative samples with similar word biases. Subsequently, we devise an adapted debiasing contrastive learning mechanism to empower the model to learn robust task-relevant features and alleviate the adverse effect of biased words. Extensive experiments show the superiority of the proposed framework. \ No newline at end of file diff --git a/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation b/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation new file mode 100644 index 0000000000..c0d77bf76a --- /dev/null +++ b/data/2024/aaai/DeblurSR: Event-Based Motion Deblurring under the Spiking Representation @@ -0,0 +1 @@ +We present DeblurSR, a novel motion deblurring approach that converts a blurry image into a sharp video. DeblurSR utilizes event data to compensate for motion ambiguities and exploits the spiking representation to parameterize the sharp output video as a mapping from time to intensity. Our key contribution, the Spiking Representation (SR), is inspired by the neuromorphic principles determining how biological neurons communicate with each other in living organisms. We discuss why the spikes can represent sharp edges and how the spiking parameters are interpreted from the neuromorphic perspective. DeblurSR has higher output quality and requires fewer computing resources than state-of-the-art event-based motion deblurring methods. We additionally show that our approach easily extends to video super-resolution when combined with recent advances in implicit neural representation. \ No newline at end of file diff --git a/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization b/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization new file mode 100644 index 0000000000..a776ec5e61 --- /dev/null +++ b/data/2024/aaai/Decentralized Gradient-Free Methods for Stochastic Non-smooth Non-convex Optimization @@ -0,0 +1 @@ +We consider decentralized gradient-free optimization of minimizing Lipschitz continuous functions that satisfy neither smoothness nor convexity assumption. We propose two novel gradient-free algorithms, the Decentralized Gradient-Free Method (DGFM) and its variant, the Decentralized Gradient-Free Method+ (DGFM+). Based on the techniques of randomized smoothing and gradient tracking, DGFM requires the computation of the zeroth-order oracle of a single sample in each iteration, making it less demanding in terms of computational resources for individual computing nodes. Theoretically, DGFM achieves a complexity of O(d^(3/2)δ^(-1)ε^(-4)) for obtaining a (δ,ε)-Goldstein stationary point. DGFM+, an advanced version of DGFM, incorporates variance reduction to further improve the convergence behavior. It samples a mini-batch at each iteration and periodically draws a larger batch of data, which improves the complexity to O(d^(3/2)δ^(-1)ε^(-3)). Moreover, experimental results underscore the empirical advantages of our proposed algorithms when applied to real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding b/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding new file mode 100644 index 0000000000..302e102a82 --- /dev/null +++ b/data/2024/aaai/Decentralized Monte Carlo Tree Search for Partially Observable Multi-Agent Pathfinding @@ -0,0 +1 @@ +The Multi-Agent Pathfinding (MAPF) problem involves finding a set of conflict-free paths for a group of agents confined to a graph. In typical MAPF scenarios, the graph and the agents' starting and ending vertices are known beforehand, allowing the use of centralized planning algorithms. However, in this study, we focus on the decentralized MAPF setting, where the agents may observe the other agents only locally and are restricted in communications with each other. Specifically, we investigate the lifelong variant of MAPF, where new goals are continually assigned to the agents upon completion of previous ones. Drawing inspiration from the successful AlphaZero approach, we propose a decentralized multi-agent Monte Carlo Tree Search (MCTS) method for MAPF tasks. Our approach utilizes the agent's observations to recreate the intrinsic Markov decision process, which is then used for planning with a tailored for multi-agent tasks version of neural MCTS. The experimental results show that our approach outperforms state-of-the-art learnable MAPF solvers. The source code is available at https://github.com/AIRI-Institute/mats-lp. \ No newline at end of file diff --git a/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits b/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits new file mode 100644 index 0000000000..d830e103ee --- /dev/null +++ b/data/2024/aaai/Decentralized Scheduling with QoS Constraints: Achieving O(1) QoS Regret of Multi-Player Bandits @@ -0,0 +1 @@ +We consider a decentralized multi-player multi-armed bandit (MP-MAB) problem where players cannot observe the actions and rewards of other players and no explicit communication or coordination between players is possible. Prior studies mostly focus on maximizing the sum of rewards of the players over time. However, the total reward maximization learning may lead to imbalanced reward among players, leading to poor Quality of Service (QoS) for some players. In contrast, our objective is to let each player n achieve a predetermined expected average reward over time, i.e., achieving a predetermined level of QoS. We develop a novel decentralized MP-MAB algorithm to accomplish this objective by leveraging the methodology of randomized matching. We prove that our decentralized algorithm can ensure that all players have an O(1) QoS regret. We also reveal an analog between our MP-MAB model and the online wireless queuing systems, which builds a connection between QoS in MP-MAB learning and stability in queuing theory. \ No newline at end of file diff --git a/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization b/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization new file mode 100644 index 0000000000..2f1df0cc9c --- /dev/null +++ b/data/2024/aaai/Decentralized Sum-of-Nonconvex Optimization @@ -0,0 +1 @@ +We consider the optimization problem of minimizing the sum-of-nonconvex function, i.e., a convex function that is the average of nonconvex components. The existing stochastic algorithms for such a problem only focus on a single machine and the centralized scenario. In this paper, we study the sum-of-nonconvex optimization in the decentralized setting. We present a new theoretical analysis of the PMGT-SVRG algorithm for this problem and prove the linear convergence of their approach. However, the convergence rate of the PMGT-SVRG algorithm has a linear dependency on the condition number, which is undesirable for the ill-conditioned problem. To remedy this issue, we propose an accelerated stochastic decentralized first-order algorithm by incorporating the techniques of acceleration, gradient tracking, and multi-consensus mixing into the SVRG algorithm. The convergence rate of the proposed method has a square-root dependency on the condition number. The numerical experiments validate the theoretical guarantee of our proposed algorithms on both synthetic and real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation b/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation new file mode 100644 index 0000000000..7fe87ae6ef --- /dev/null +++ b/data/2024/aaai/Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation @@ -0,0 +1 @@ +Understanding and accurately explaining compatibility relationships between fashion items is a challenging problem in the burgeoning domain of AI-driven outfit recommendations. Present models, while making strides in this area, still occasionally fall short, offering explanations that can be elementary and repetitive. This work aims to address these shortcomings by introducing the Pair Fashion Explanation (PFE) dataset, a unique resource that has been curated to illuminate these compatibility relationships. Furthermore, we propose an innovative two stage pipeline model that leverages this dataset. This fine-tuning allows the model to generate explanations that convey the compatibility relationships between items. Our experiments showcase the model's potential in crafting descriptions that are knowledgeable, aligned with ground-truth matching correlations, and that produce understandable and informative descriptions, as assessed by both automatic metrics and human evaluation. Our code and data are released at https://github.com/wangyu-ustc/PairFashionExplanation. \ No newline at end of file diff --git a/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees b/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees new file mode 100644 index 0000000000..64db8168ee --- /dev/null +++ b/data/2024/aaai/Deciphering Raw Data in Neuro-Symbolic Learning with Provable Guarantees @@ -0,0 +1 @@ +Neuro-symbolic hybrid systems are promising for integrating machine learning and symbolic reasoning, where perception models are facilitated with information inferred from a symbolic knowledge base through logical reasoning. Despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Hence, it remains unclear why a hybrid system succeeds for a specific task and when it may fail given a different knowledge base. In this paper, we introduce a novel way of characterising supervision signals from a knowledge base, and establish a criterion for determining the knowledge’s efficacy in facilitating successful learning. This, for the first time, allows us to address the two questions above by inspecting the knowledge base under investigation. Our analysis suggests that many knowledge bases satisfy the criterion, thus enabling effective learning, while some fail to satisfy it, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks. \ No newline at end of file diff --git a/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs b/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs new file mode 100644 index 0000000000..a4140c38d4 --- /dev/null +++ b/data/2024/aaai/Decision-Making for Land Conservation: A Derivative-Free Optimization Framework with Nonlinear Inputs @@ -0,0 +1,6 @@ +Protected areas (PAs) are designated spaces where human activities are restricted to preserve critical habitats. Decision-makers are challenged with balancing a trade-off of financial feasibility with ecological benefit when establishing PAs. Given the long-term ramifications of these decisions and the constantly shifting environment, it is crucial that PAs are carefully selected with long-term viability in mind. + +Using AI tools like simulation and optimization is common for designating PAs, but current decision models are primarily linear. In this paper, we propose a derivative-free optimization framework paired with a nonlinear component, population viability analysis (PVA). Formulated as a mixed integer nonlinear programming (MINLP) problem, our model allows for linear and nonlinear inputs. Connectivity, competition, crowding, and other similar concerns are handled by the PVA software, rather than expressed as constraints of the optimization model. In addition, we present numerical results that serve as a proof of concept, showing our models yield PAs with similar expected risk to that of preserving every parcel in a habitat, but at a significantly lower cost. + +The overall goal is to promote interdisciplinary work by providing a new mathematical programming tool for conservationists that allows for nonlinear inputs and can be paired with existing ecological software. The code and data are available at +https://github.com/cassiebuhler/conservation-dfo. \ No newline at end of file diff --git a/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making b/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making new file mode 100644 index 0000000000..77be16c203 --- /dev/null +++ b/data/2024/aaai/Decoding AI's Nudge: A Unified Framework to Predict Human Behavior in AI-Assisted Decision Making @@ -0,0 +1,3 @@ +With the rapid development of AI-based decision aids, different forms of AI assistance have been increasingly integrated into the human decision making processes. To best support humans in decision making, it is essential to quantitatively understand how diverse forms of AI assistance influence humans' decision making behavior. To this end, much of the current research focuses on the end-to-end prediction of human behavior using ``black-box'' models, often lacking interpretations of the nuanced ways in which AI assistance impacts the human decision making process. +Meanwhile, methods that prioritize the interpretability of human behavior predictions are often tailored for one specific form of AI assistance, making adaptations to other forms of assistance difficult. In this paper, we propose a computational framework that can provide an interpretable characterization of the influence of different forms of AI assistance on decision makers in AI-assisted decision making. By conceptualizing AI assistance as the ``nudge'' in human decision making processes, our approach centers around modelling how different forms of AI assistance modify humans' strategy in weighing different information in making their decisions. Evaluations on behavior data collected from real human decision makers +show that the proposed framework outperforms various baselines in accurately predicting human behavior in AI-assisted decision making. Based on the proposed framework, we further provide insights into how individuals with different cognitive styles are nudged by AI assistance differently. \ No newline at end of file diff --git a/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning b/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning new file mode 100644 index 0000000000..f9073cec39 --- /dev/null +++ b/data/2024/aaai/Decoding Global Preferences: Temporal and Cooperative Dependency Modeling in Multi-Agent Preference-Based Reinforcement Learning @@ -0,0 +1,2 @@ +Designing accurate reward functions for reinforcement learning (RL) has long been challenging. Preference-based RL (PbRL) offers a promising approach by using human preferences +to train agents, eliminating the need for manual reward design. While successful in single-agent tasks, extending PbRL to complex multi-agent scenarios is nontrivial. Existing PbRL methods lack the capacity to comprehensively capture both temporal and cooperative aspects, leading to inadequate reward functions. This work introduces an advanced multi-agent preference learning framework that effectively addresses these limitations. Based on a cascading Transformer architecture, our approach captures both temporal and cooperative dependencies, alleviating issues related to reward uniformity and intricate interactions among agents. Experimental results demonstrate substantial performance improvements in multi-agent cooperative tasks, and the reconstructed reward function closely resembles expert-defined reward functions. The source code is available at https://github.com/catezi/MAPT. \ No newline at end of file diff --git a/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations b/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations new file mode 100644 index 0000000000..ce8c29c858 --- /dev/null +++ b/data/2024/aaai/Decomposing Constraint Networks for Calculating c-Representations @@ -0,0 +1 @@ +It is well-known from probability theory that network-based methods like Bayesian networks constitute remarkable frameworks for efficient probabilistic reasoning. In this paper, we focus on qualitative default reasoning based on Spohn’s ranking functions for which network-based methods have not yet been studied satisfactorily. With constraint networks, we develop a framework for iterative calculations of c-representations, a family of ranking models of conditional belief bases which show outstanding properties from a commonsense and formal point of view, that are characterized by assigning possible worlds a degree of implausibility via penalizing the falsification of conditionals. Constraint networks unveil the dependencies among these penalty points (and hence among the conditionals) and make it possible to compute the penalty points locally on so-called safe sub-bases. As an application of our framework, we show that skeptical c-inferences can be drawn locally from safe sub-bases without losing validity. \ No newline at end of file diff --git a/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval b/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval new file mode 100644 index 0000000000..5adf9af958 --- /dev/null +++ b/data/2024/aaai/Decomposing Semantic Shifts for Composed Image Retrieval @@ -0,0 +1 @@ +Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift Network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. The code is available at https://github.com/starxing-yuu/SSN. \ No newline at end of file diff --git a/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning b/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..f4acad47bd --- /dev/null +++ b/data/2024/aaai/Decomposing Temporal Equilibrium Strategy for Coordinated Distributed Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +The increasing demands for system complexity and robustness have prompted the integration of temporal logic into Multi-Agent Reinforcement Learning (MARL) to address tasks with non-Markovian properties. However, incorporating non-Markovian properties introduces additional computational complexities, as agents are required to integrate historical data into their decision-making process. Also, optimizing strategies within a multi-agent environment presents significant challenges due to the exponential growth of the state space with the number of agents. In this study, we introduce an innovative hierarchical MARL framework that synthesizes temporal equilibrium strategies through parity games and subsequently encodes them as individual reward machines for MARL coordination. More specifically, we reduce the strategy synthesis problem into an emptiness problem concerning parity games with optimized states and transitions. Following this synthesis step, the temporal equilibrium strategy is decomposed into individual reward machines for decentralized MARL. Theoretical proofs are provided to verify the consistency of the Nash equilibrium between the parallel composition of decomposed strategies and the original strategy. Empirical evidence confirms the efficacy of the proposed synthesis technique, showcasing its ability to reduce state space compared to the state-of-the-art tool. Furthermore, our study highlights the superior performance of the distributed MARL paradigm over centralized approaches when deploying decomposed strategies. \ No newline at end of file diff --git a/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract) b/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract) new file mode 100644 index 0000000000..562e459b5a --- /dev/null +++ b/data/2024/aaai/Decompositions in Compositional Translation of LTLf to DFA (Student Abstract) @@ -0,0 +1 @@ +Prior compositional methods in LTLf to DFA conversion have focussed on improving the composition phase. In this work, we examine improvements to the decomposition phase that result in overall improvements in LTLf to DFA translation. Our work is based on reducing the structure of the underlying Abstract Syntax Tree (AST) of a formula such that the new AST results in fewer composition operations. \ No newline at end of file diff --git a/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation b/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation new file mode 100644 index 0000000000..0ccf97575f --- /dev/null +++ b/data/2024/aaai/Decouple Content and Motion for Conditional Image-to-Video Generation @@ -0,0 +1 @@ +The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text. The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. Additionally, the efficiency of generating videos in pixel space is quite low. In this paper, we propose a novel approach to address these challenges by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions. Specifically, we predict temporal motions which include motion vector and residual based on a 3D-UNet diffusion model. By explicitly modeling temporal motions and warping them to the starting image, we improve the temporal consistency of generated videos. This results in a reduction of spatial redundancy, emphasizing temporal details. Our proposed method achieves performance improvements by disentangling content and motion, all without introducing new structural complexities to the model. Extensive experiments on various datasets confirm our approach's superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency. \ No newline at end of file diff --git a/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition b/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition new file mode 100644 index 0000000000..de3d205789 --- /dev/null +++ b/data/2024/aaai/Decoupled Contrastive Learning for Long-Tailed Recognition @@ -0,0 +1,2 @@ +Supervised Contrastive Loss (SCL) is popular in visual representation learning. + Given an anchor image, SCL pulls two types of positive samples, i.e., its augmentation and other images from the same class together, while pushes negative images apart to optimize the learned embedding. In the scenario of long-tailed recognition, where the number of samples in each class is imbalanced, treating two types of positive samples equally leads to the biased optimization for intra-category distance. In addition, similarity relationship among negative samples, that are ignored by SCL, also presents meaningful semantic cues. To improve the performance on long-tailed recognition, this paper addresses those two issues of SCL by decoupling the training objective. Specifically, it decouples two types of positives in SCL and optimizes their relations toward different objectives to alleviate the influence of the imbalanced dataset. We further propose a patch-based self distillation to transfer knowledge from head to tail classes to relieve the under-representation of tail classes. It uses patch-based features to mine shared visual patterns among different instances and leverages a self distillation procedure to transfer such knowledge. Experiments on different long-tailed classification benchmarks demonstrate the superiority of our method. For instance, it achieves the 57.7% top-1 accuracy on the ImageNet-LT dataset. Combined with the ensemble-based method, the performance can be further boosted to 59.7%, which substantially outperforms many recent works. Our code will be released. \ No newline at end of file diff --git a/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition b/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition new file mode 100644 index 0000000000..c3e68e5908 --- /dev/null +++ b/data/2024/aaai/Decoupled Optimisation for Long-Tailed Visual Recognition @@ -0,0 +1,2 @@ +When training on a long-tailed dataset, conventional learning algorithms tend to exhibit a bias towards classes with a larger sample size. Our investigation has revealed that this biased learning tendency originates from the model parameters, which are trained to disproportionately contribute to the classes characterised by their sample size (e.g., many, medium, and few classes). +To balance the overall parameter contribution across all classes, we investigate the importance of each model parameter to the learning of different class groups, and propose a multistage parameter Decouple and Optimisation (DO) framework that decouples parameters into different groups with each group learning a specific portion of classes. To optimise the parameter learning, we apply different training objectives with a collaborative optimisation step to learn complementary information about each class group. Extensive experiments on long-tailed datasets, including CIFAR100, Places-LT, ImageNet-LT, and iNaturaList 2018, show that our framework achieves competitive performance compared to the state-of-the-art. \ No newline at end of file diff --git a/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation b/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation new file mode 100644 index 0000000000..81567fe9f0 --- /dev/null +++ b/data/2024/aaai/Decoupled Textual Embeddings for Customized Image Generation @@ -0,0 +1 @@ +Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX. \ No newline at end of file diff --git a/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning b/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning new file mode 100644 index 0000000000..9bf7e31a99 --- /dev/null +++ b/data/2024/aaai/Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning @@ -0,0 +1 @@ +Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems. \ No newline at end of file diff --git a/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera b/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera new file mode 100644 index 0000000000..d11914bddd --- /dev/null +++ b/data/2024/aaai/Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera @@ -0,0 +1 @@ +Under-display camera (UDC) systems are the foundation of full-screen display devices in which the lens mounts under the display. The pixel array of light-emitting diodes used for display diffracts and attenuates incident light, causing various degradations as the light intensity changes. Unlike general video restoration which recovers video by treating different degradation factors equally, video restoration for UDC systems is more challenging that concerns removing diverse degradation over time while preserving temporal consistency. In this paper, we introduce a novel video restoration network, called D2RNet, specifically designed for UDC systems. It employs a set of Decoupling Attention Modules (DAM) that effectively separate the various video degradation factors. More specifically, a soft mask generation function is proposed to formulate each frame into flare and haze based on the diffraction arising from incident light of different intensities, followed by the proposed flare and haze removal components that leverage long- and short-term feature learning to handle the respective degradations. Such a design offers an targeted and effective solution to eliminating various types of degradation in UDC systems. We further extend our design into multi-scale to overcome the scale-changing of degradation that often occur in long-range videos. To demonstrate the superiority of D2RNet, we propose a large-scale UDC video benchmark by gathering HDR videos and generating realistically degraded videos using the point spread function measured by a commercial UDC system. Extensive quantitative and qualitative evaluations demonstrate the superiority of D2RNet compared to other state-of-the-art video restoration and UDC image restoration methods. \ No newline at end of file diff --git a/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling b/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling new file mode 100644 index 0000000000..8a60933546 --- /dev/null +++ b/data/2024/aaai/Decoupling Representation and Knowledge for Few-Shot Intent Classification and Slot Filling @@ -0,0 +1 @@ +Few-shot intent classification and slot filling are important but challenging tasks due to the scarcity of finely labeled data. Therefore, current works first train a model on source domains with sufficiently labeled data, and then transfer the model to target domains where only rarely labeled data is available. However, experience transferring as a whole usually suffers from gaps that exist among source domains and target domains. For instance, transferring domain-specific-knowledge-related experience is difficult. To tackle this problem, we propose a new method that explicitly decouples the transferring of general-semantic-representation-related experience and the domain-specific-knowledge-related experience. Specifically, for domain-specific-knowledge-related experience, we design two modules to capture intent-slot relation and slot-slot relation respectively. Extensive experiments on Snips and FewJoint datasets show that our method achieves state-of-the-art performance. The method improves the joint accuracy metric from 27.72% to 42.20% in the 1-shot setting, and from 46.54% to 60.79% in the 5-shot setting. \ No newline at end of file diff --git a/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract) b/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract) new file mode 100644 index 0000000000..b3b36897a4 --- /dev/null +++ b/data/2024/aaai/Decoupling User Relationships Guides Information Diffusion Prediction (Student Abstract) @@ -0,0 +1 @@ +Information diffusion prediction is a critical task for many social network applications. However, current methods are mainly limited by the following aspects: user relationships behind resharing behaviors are complex and entangled. To address these issues, we propose MHGFormer, a novel multi-channel hypergraph transformer framework, to better decouple complex user relations and obtain fine-grained user representations. First, we employ designed triangular motifs to decouple user relations into three different level hypergraphs. Second, a position-aware hypergraph transformer is used to refine user relation and obtain high-quality user representations. Extensive experiments conducted on two social datasets demonstrate that MHGFormer outperforms state-of-the-art diffusion models across several settings. \ No newline at end of file diff --git a/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance b/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance new file mode 100644 index 0000000000..dcb734a962 --- /dev/null +++ b/data/2024/aaai/Deep Contrastive Graph Learning with Clustering-Oriented Guidance @@ -0,0 +1 @@ +Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graph-based clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees b/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees new file mode 100644 index 0000000000..ff3d8f00c4 --- /dev/null +++ b/data/2024/aaai/Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees @@ -0,0 +1 @@ +Censoring is the central problem in survival analysis where either the time-to-event (for instance, death), or the time-to censoring (such as loss of follow-up) is observed for each sample. The majority of existing machine learning-based survival analysis methods assume that survival is conditionally independent of censoring given a set of covariates; an assumption that cannot be verified since only marginal distributions is available from the data. The existence of dependent censoring, along with the inherent bias in current estimators has been demonstrated in a variety of applications, accentuating the need for a more nuanced approach. However, existing methods that adjust for dependent censoring require practitioners to specify the ground truth copula. This requirement poses a significant challenge for practical applications, as model misspecification can lead to substantial bias. In this work, we propose a flexible deep learning-based survival analysis method that simultaneously accommodate for dependent censoring and eliminates the requirement for specifying the ground truth copula. We theoretically prove the identifiability of our model under a broad family of copulas and survival distributions. Experiments results from a wide range of datasets demonstrate that our approach successfully discerns the underlying dependency structure and significantly reduces survival estimation bias when compared to existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Deep Hierarchical Video Compression b/data/2024/aaai/Deep Hierarchical Video Compression new file mode 100644 index 0000000000..b48612dc51 --- /dev/null +++ b/data/2024/aaai/Deep Hierarchical Video Compression @@ -0,0 +1 @@ +Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. \ No newline at end of file diff --git a/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition b/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition new file mode 100644 index 0000000000..3195abdca8 --- /dev/null +++ b/data/2024/aaai/Deep Homography Estimation for Visual Place Recognition @@ -0,0 +1 @@ +Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR. \ No newline at end of file diff --git a/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information b/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information new file mode 100644 index 0000000000..82c1f3b8f1 --- /dev/null +++ b/data/2024/aaai/Deep Incomplete Multi-View Learning Network with Insufficient Label Information @@ -0,0 +1 @@ +Due to the efficiency of integrating semantic consensus and complementary information across different views, multi-view classification methods have attracted much attention in recent years. However, multi-view data often suffers from both the miss of view features and insufficient label information, which significantly decrease the performance of traditional multi-view classification methods in practice. Learning for such simultaneous lack of feature and label is crucial but rarely studied. To tackle these problems, we propose a novel Deep Incomplete Multi-view Learning Network (DIMvLN) by incorporating graph networks and semi-supervised learning in this paper. Specifically, DIMvLN firstly designs the deep graph networks to effectively recover missing data with assigning pseudo-labels of large amounts of unlabeled instances and refine the incomplete feature information. Meanwhile, to enhance the label information, a novel pseudo-label generation strategy with the similarity constraints of unlabeled instances is proposed to exploit additional supervisory information and guide the completion module to preserve more semantic information of absent multi-view data. Besides, we design view-specific representation extractors with the autoencoder structure and contrastive loss to learn high-level semantic representations for each view, promote cross-view consistencies and augment the separability between different categories. Finally, extensive experimental results demonstrate the effectiveness of our DIMvLN, attaining noteworthy performance improvements compared to state-of-the-art competitors on several public benchmark datasets. Code will be available at GitHub. \ No newline at end of file diff --git a/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation b/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation new file mode 100644 index 0000000000..cdf06c85e5 --- /dev/null +++ b/data/2024/aaai/Deep Learning for Style Transfer and Experimentation with Audio Effects and Music Creation @@ -0,0 +1 @@ +Recent advancements in deep learning have the potential to transform the process of writing and creating music. Models that have the potential to capture and analyze higher-level representations of music and audio can serve to change the field of digital signal processing. In this statement, I propose a set of Music+AI methods that serves to assist with the writing of and melodies, modelling and transferring of timbres, applying a wide variety of audio effects, including research into experimental audio effects, and production of audio samples using style transfers. Writing and producing music is a tedious task that is notably difficult to become proficient in, as many tools to create music both cost sums money and require long-term commitments to study. An all-encompassing framework for music processing would make the process much more accessible and simple and would allow for human art to work alongside technology to advance. \ No newline at end of file diff --git a/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration b/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration new file mode 100644 index 0000000000..209ec08371 --- /dev/null +++ b/data/2024/aaai/Deep Learning on Graphs: A Data-Centric Exploration @@ -0,0 +1 @@ +Many learning tasks in Artificial Intelligence (AI) require dealing with graph data, ranging from biology and chemistry to finance and education. As powerful deep learning tools for graphs, graph neural networks (GNNs) have demonstrated remarkable performance in various graph-related applications. Despite the significant accomplishments of GNNs, recent studies have highlighted that their efficiency and effectiveness face significant challenges such as adversarial robustness and scalability, which are fundamentally linked to data. While major attention has been devoted to improving GNNs from the model perspective, the potential of directly enhancing data has often been overlooked. It underscores a critical gap in GNN research---while model improvements are undoubtedly important, we also need to recognize and address the data-related factors contributing to the challenges. Hence, my research is to investigate solutions for these challenges from the data perspective, employing strategies such as data characterization, reduction, augmentation, transformation, and detection. \ No newline at end of file diff --git a/data/2024/aaai/Deep Reinforcement Learning for Communication Networks b/data/2024/aaai/Deep Reinforcement Learning for Communication Networks new file mode 100644 index 0000000000..a67b603428 --- /dev/null +++ b/data/2024/aaai/Deep Reinforcement Learning for Communication Networks @@ -0,0 +1 @@ +This research explores optimizing communication tasks with (Multi-Agent) Reinforcement Learning (RL/MARL) in Point-to-Point and Group Communication (GC) networks. The study initially applied RL for Congestion Control in networks with dynamic link properties, yielding competitive results. Then, it focused on the challenge of effective message dissemination in GC networks, by framing a novel game-theoretic formulation and designing methods to solve the task based on MARL and Graph Convolution. Future research will deepen the exploration of MARL in GC. This will contribute to both academic knowledge and practical advancements in the next generation of communication protocols. \ No newline at end of file diff --git a/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer b/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer new file mode 100644 index 0000000000..dc8e7bccb6 --- /dev/null +++ b/data/2024/aaai/Deep Reinforcement Learning for Early Diagnosis of Lung Cancer @@ -0,0 +1 @@ +Lung cancer remains the leading cause of cancer-related death worldwide, and early diagnosis of lung cancer is critical for improving the survival rate of patients. Performing annual low-dose computed tomography (LDCT) screening among high-risk populations is the primary approach for early diagnosis. However, after each screening, whether to continue monitoring (with follow-up screenings) or to order a biopsy for diagnosis remains a challenging decision to make. Continuing with follow-up screenings may lead to delayed diagnosis but ordering a biopsy without sufficient evidence incurs unnecessary risk and cost. In this paper, we tackle the problem by an optimal stopping approach. Our proposed algorithm, called EarlyStop-RL, utilizes the structure of the Snell envelope for optimal stopping, and model-free deep reinforcement learning for making diagnosis decisions. Through evaluating our algorithm on a commonly used clinical trial dataset (the National Lung Screening Trial), we demonstrate that EarlyStop-RL has the potential to greatly enhance risk assessment and early diagnosis of lung cancer, surpassing the performance of two widely adopted clinical models, namely the Lung-RADS and the Brock model. \ No newline at end of file diff --git a/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation b/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation new file mode 100644 index 0000000000..7e0ddf1999 --- /dev/null +++ b/data/2024/aaai/Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation @@ -0,0 +1 @@ +Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer. \ No newline at end of file diff --git a/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks b/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks new file mode 100644 index 0000000000..d9369c653e --- /dev/null +++ b/data/2024/aaai/Deep Structural Knowledge Exploitation and Synergy for Estimating Node Importance Value on Heterogeneous Information Networks @@ -0,0 +1 @@ +The classic problem of node importance estimation has been conventionally studied with homogeneous network topology analysis. To deal with practical network heterogeneity, a few recent methods employ graph neural models to automatically learn diverse sources of information. However, the major concern revolves around that their fully adaptive learning process may lead to insufficient information exploration, thereby formulating the problem as the isolated node value prediction with underperformance and less interpretability. In this work, we propose a novel learning framework namely SKES. Different from previous automatic learning designs, SKES exploits heterogeneous structural knowledge to enrich the informativeness of node representations. Then based on a sufficiently uninformative reference, SKES estimates the importance value for any input node, by quantifying its informativeness disparity against the reference. This establishes an interpretable node importance computation paradigm. Furthermore, SKES dives deep into the understanding that "nodes with similar characteristics are prone to have similar importance values" whilst guaranteeing that such informativeness disparity between any different nodes is orderly reflected by the embedding distance of their associated latent features. Extensive experiments on three widely-evaluated benchmarks demonstrate the performance superiority of SKES over several recent competing methods. \ No newline at end of file diff --git a/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening b/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening new file mode 100644 index 0000000000..35421a1658 --- /dev/null +++ b/data/2024/aaai/Deep Unfolded Network with Intrinsic Supervision for Pan-Sharpening @@ -0,0 +1 @@ +Existing deep pan-sharpening methods lack the learning of complementary information between PAN and MS modalities in the intermediate layers, and exhibit low interpretability due to their black-box designs. To this end, an interpretable deep unfolded network with intrinsic supervision for pan-sharpening is proposed. Building upon the observation degradation process, it formulates the pan-sharpening task as a variational model minimization with spatial consistency prior and spectral projection prior. The former prior requires a joint component decomposition of PAN and MS images to extract intrinsic features. By being supervised in the intermediate layers, it can selectively provide high-frequency information for spatial enhancement. The latter prior constrains the intensity correlation between MS and PAN images derived from physical observations, so as to improve spectral fidelity. To further enhance the transparency of network design, we develop an iterative solution algorithm following the half-quadratic splitting to unfold the deep model. It rigorously adheres to the variational model, significantly enhancing the interpretability behind network design and efficiently alternating the optimization of the network. Extensive experiments demonstrate the advantages of our method compared to state-of-the-arts, showcasing its remarkable generalization capability to real-world scenes. Our code is publicly available at https://github.com/Baixuzx7/DISPNet. \ No newline at end of file diff --git a/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures b/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures new file mode 100644 index 0000000000..f684640528 --- /dev/null +++ b/data/2024/aaai/Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures @@ -0,0 +1 @@ +Incomplete multi-view clustering (IMVC) aims to reveal shared clustering structures within multi-view data, where only partial views of the samples are available. Existing IMVC methods primarily suffer from two issues: 1) Imputation-based methods inevitably introduce inaccurate imputations, which in turn degrade clustering performance; 2) Imputation-free methods are susceptible to unbalanced information among views and fail to fully exploit shared information. To address these issues, we propose a novel method based on variational autoencoders. Specifically, we adopt multiple view-specific encoders to extract information from each view and utilize the Product-of-Experts approach to efficiently aggregate information to obtain the common representation. To enhance the shared information in the common representation, we introduce a coherence objective to mitigate the influence of information imbalance. By incorporating the Mixture-of-Gaussians prior information into the latent representation, our proposed method is able to learn the common representation with clustering-friendly structures. Extensive experiments on four datasets show that our method achieves competitive clustering performance compared with state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving b/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving new file mode 100644 index 0000000000..c214e8ebd8 --- /dev/null +++ b/data/2024/aaai/DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving @@ -0,0 +1 @@ +Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model. \ No newline at end of file diff --git a/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation b/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation new file mode 100644 index 0000000000..98b4f11092 --- /dev/null +++ b/data/2024/aaai/DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation @@ -0,0 +1 @@ +Formal certification of Neural Networks (NNs) is crucial for ensuring their safety, fairness, and robustness. Unfortunately, on the one hand, sound and complete certification algorithms of ReLU-based NNs do not scale to large-scale NNs. On the other hand, incomplete certification algorithms are easier to compute, but they result in loose bounds that deteriorate with the depth of NN, which diminishes their effectiveness. In this paper, we ask the following question; can we replace the ReLU activation function with one that opens the door to incomplete certification algorithms that are easy to compute but can produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class of NNs with activation functions based on Bernstein polynomials instead of the commonly used ReLU activation. Bernstein polynomials are smooth and differentiable functions with desirable properties such as the so-called range enclosure and subdivision properties. We design a novel Interval Bound Propagation (IBP) algorithm, called Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our approach leverages the properties of Bernstein polynomials to improve the tractability of neural network certification tasks while maintaining the accuracy of the trained networks. We conduct experiments in adversarial robustness and reachability analysis settings to assess the effectiveness of the approach. Our proposed framework achieves high certified accuracy for adversarially-trained NNs, which is often a challenging task for certifiers of ReLU-based NNs. This work establishes Bernstein polynomial activation as a promising alternative for improving NN certification tasks across various NNs applications. \ No newline at end of file diff --git a/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning b/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning new file mode 100644 index 0000000000..254e9ab038 --- /dev/null +++ b/data/2024/aaai/DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning @@ -0,0 +1 @@ +Curvilinear structures, which include line-like continuous objects, are fundamental geometrical elements in image-based applications. Reconstructing these structures from images constitutes a pivotal research area in computer vision. However, the complex topology and ambiguous image evidence render this process a challenging task. In this paper, we introduce DeepBranchTracer, a novel method that learns both external image features and internal geometric characteristics to reconstruct curvilinear structures. Firstly, we formulate the curvilinear structures extraction as a geometric attribute estimation problem. Then, a curvilinear structure feature learning network is designed to extract essential branch attributes, including the image features of centerline and boundary, and the geometric features of direction and radius. Finally, utilizing a multi-feature fusion tracing strategy, our model iteratively traces the entire branch by integrating the extracted image and geometric features. We extensively evaluated our model on both 2D and 3D datasets, demonstrating its superior performance over existing segmentation and reconstruction methods in terms of accuracy and continuity. \ No newline at end of file diff --git a/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models b/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models new file mode 100644 index 0000000000..5adf07e72d --- /dev/null +++ b/data/2024/aaai/DeepCalliFont: Few-Shot Chinese Calligraphy Font Synthesis by Integrating Dual-Modality Generative Models @@ -0,0 +1 @@ +Few-shot font generation, especially for Chinese calligraphy fonts, is a challenging and ongoing problem. With the help of prior knowledge that is mainly based on glyph consistency assumptions, some recently proposed methods can synthesize high-quality Chinese glyph images. However, glyphs in calligraphy font styles often do not meet these assumptions. To address this problem, we propose a novel model, DeepCalliFont, for few-shot Chinese calligraphy font synthesis by integrating dual-modality generative models. Specifically, the proposed model consists of image synthesis and sequence generation branches, generating consistent results via a dual-modality representation learning strategy. The two modalities (i.e., glyph images and writing sequences) are properly integrated using a feature recombination module and a rasterization loss function. Furthermore, a new pre-training strategy is adopted to improve the performance by exploiting large amounts of uni-modality data. Both qualitative and quantitative experiments have been conducted to demonstrate the superiority of our method to other state-of-the-art approaches in the task of few-shot Chinese calligraphy font synthesis. The source code can be found at https://github.com/lsflyt-pku/DeepCalliFont. \ No newline at end of file diff --git a/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction b/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction new file mode 100644 index 0000000000..9e0ec41830 --- /dev/null +++ b/data/2024/aaai/DeepSaDe: Learning Neural Networks That Guarantee Domain Constraint Satisfaction @@ -0,0 +1 @@ +As machine learning models, specifically neural networks, are becoming increasingly popular, there are concerns regarding their trustworthiness, specially in safety-critical applications, e.g. actions of an autonomous vehicle must be safe. There are approaches that can train neural networks where such domain requirements are enforced as constraints, but they either cannot guarantee that the constraint will be satisfied by all possible predictions (even on unseen data) or they are limited in the type of constraints that can be enforced. In this paper, we present an approach to train neural networks which can enforce a wide variety of constraints and guarantee that the constraint is satisfied by all possible predictions. The approach builds on earlier work where learning linear models is formulated as a constraint satisfaction problem (CSP). To make this idea applicable to neural networks, two crucial new elements are added: constraint propagation over the network layers, and weight updates based on a mix of gradient descent and CSP solving. Evaluation on various machine learning tasks demonstrates that our approach is flexible enough to enforce a wide variety of domain constraints and is able to guarantee them in neural networks. \ No newline at end of file diff --git a/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing b/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing new file mode 100644 index 0000000000..074e17b0f5 --- /dev/null +++ b/data/2024/aaai/DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing @@ -0,0 +1 @@ +Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost ($3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost ($46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning. \ No newline at end of file diff --git a/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation b/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation new file mode 100644 index 0000000000..2e1da4df1a --- /dev/null +++ b/data/2024/aaai/Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation @@ -0,0 +1 @@ +We present a novel computational approach to resolving conflicts among norms by nonmonotonic normative reasoning (in constrained I/O logics). Our approach extends standard sequent-based proof systems and makes them more adequate to nonmonotonic reasoning by adding to the sequents annotations that keep track of what is known about the defeasible status of the derived sequents. This makes transparent the reasons according to which norms should be applicable or inapplicable, and accordingly the sequents that make use of such norms are accepted or retracted. We also show that this proof theoretic method has tight links to the semantics of formal argumentation frameworks. The outcome of this paper is thus a threefold characterization result that relates, in the context of nonmonotonic normative reasoning, three traditional ingredients of AI-based reasoning methods: maximally consistent sets of premises (in constrained I/O logics), derived sequents (which are accepted in corresponding annotated sequent calculi), and logical arguments (that belong to the grounded extensions of the induced logical argumentation frameworks). \ No newline at end of file diff --git a/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World b/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World new file mode 100644 index 0000000000..a55a8bbe1b --- /dev/null +++ b/data/2024/aaai/Defog Artificial Intelligence Glasses: Neural Networks for the Imperfect Real World @@ -0,0 +1 @@ +This research investigates the generalization capabilities of neural networks in deep learning when applied to real-world scenarios where data often contains imperfections, focusing on their adaptability to both noisy and non-noisy scenarios for image retrieval tasks. Our study explores approaches to preserve all available data, regardless of quality, for diverse tasks. The evaluation of results varies per task, due to the ultimate goal of developing a technique to extract relevant information while disregarding noise in the final network design for each specific task. The aim is to enhance accessibility and efficiency of AI across diverse tasks, particularly for individuals or countries with limited resources, lacking access to high-quality data. The dedication is directed towards fostering inclusivity and unlocking the potential of AI for wide-spread societal benefit. \ No newline at end of file diff --git a/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning b/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning new file mode 100644 index 0000000000..8e6c8aa55b --- /dev/null +++ b/data/2024/aaai/Defying Imbalanced Forgetting in Class Incremental Learning @@ -0,0 +1 @@ +We observe a high level of imbalance in the accuracy of different learned classes in the same old task for the first time. This intriguing phenomenon, discovered in replay-based Class Incremental Learning (CIL), highlights the imbalanced forgetting of learned classes, as their accuracy is similar before the occurrence of catastrophic forgetting. This discovery remains previously unidentified due to the reliance on average incremental accuracy as the measurement for CIL, which assumes that the accuracy of classes within the same task is similar. However, this assumption is invalid in the face of catastrophic forgetting. Further empirical studies indicate that this imbalanced forgetting is caused by conflicts in representation between semantically similar old and new classes. These conflicts are rooted in the data imbalance present in replay-based CIL methods. Building on these insights, we propose CLass-Aware Disentanglement (CLAD) as a means to predict the old classes that are more likely to be forgotten and enhance their accuracy. Importantly, CLAD can be seamlessly integrated into existing CIL methods. Extensive experiments demonstrate that CLAD consistently improves current replay-based methods, resulting in performance gains of up to 2.56%. \ No newline at end of file diff --git a/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization b/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization new file mode 100644 index 0000000000..c0af286e7d --- /dev/null +++ b/data/2024/aaai/Delegation-Relegation for Boolean Matrix Factorization @@ -0,0 +1,2 @@ +The Boolean Matrix Factorization (BMF) problem aims to represent a n×m Boolean matrix as the Boolean product of two matrices of small rank k, where the product is computed using Boolean algebra operations. However, finding a BMF of minimum rank is known to be NP-hard, posing challenges for heuristic algorithms and exact approaches in terms of rank found and computation time, particularly as matrix size or the number of entries equal to 1 grows. +In this paper, we present a new approach to simplifying the matrix to be factorized by reducing the number of 1-entries, which allows to directly recover a Boolean factorization of the original matrix from its simplified version. We introduce two types of simplification: one that performs numerous simplifications without preserving the original rank and another that performs fewer simplifications but guarantees that an optimal BMF on the simplified matrix yields an optimal BMF on the original matrix. Furthermore, our experiments show that our approach outperforms existing exact BMF algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints b/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints new file mode 100644 index 0000000000..958d6e5e57 --- /dev/null +++ b/data/2024/aaai/Deletion-Robust Submodular Maximization with Knapsack Constraints @@ -0,0 +1 @@ +Submodular maximization algorithms have found wide applications in various fields such as data summarization, recommendation systems, and active learning. In recent years, deletion-robust submodular maximization algorithms have garnered attention due to their significant implications in scenarios where some data points may be removed due to user preferences or privacy concerns, such as in recommendation systems and influence maximization. In this paper, we study the fundamental problem of submodular maximization with knapsack constraints and propose a robust streaming algorithm for it. To the best of our knowledge, our algorithm is the first to solve this problem for non-monotone submodular functions and can achieve an approximation ratio of 1/(6.82+2.63d)-ϵ under a near-optimal summary size of O(k+r), where k denotes the maximum cardinality of any feasible solution, d denotes the number of the knapsack constraints and r is the robustness parameter. For monotone submodular functions, our algorithm can achieve an approximation ratio of 1/(2+2d)-ϵ under a near-optimal summary size of O(k+r), significantly improving upon the best-known ratio of Ω((1/d-ϵ)^2). The empirical performance of our algorithm is extensively evaluated in several applications including influence maximization and recommendation systems, and the experimental results demonstrate the effectiveness of our algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Delivering Inflated Explanations b/data/2024/aaai/Delivering Inflated Explanations new file mode 100644 index 0000000000..f0b608d8b9 --- /dev/null +++ b/data/2024/aaai/Delivering Inflated Explanations @@ -0,0 +1,2 @@ +In the quest for Explainable Artificial Intelligence (XAI) one of the questions that frequently arises given a decision made by an AI system is, ``why was the decision made in this way?'' Formal approaches to explainability build a formal model of the AI system and use this to reason about the properties of the system. Given a set of feature values for an instance to be explained, and a resulting decision, a formal abductive explanation is a set of features, such that if they take the given value will always lead to the same decision. This explanation is useful, it shows that only some features were used in making the final decision. But it is narrow, it only shows that if the selected features take their given values the decision is unchanged. It is possible that some features may change values and still lead to the same decision. In this paper we formally define inflated explanations which is a set of features, and for each feature a set of values (always including the value of the instance being explained), such that the decision will remain unchanged, for any of the values allowed for any of the features in the (inflated) abductive explanation. +Inflated formal explanations are more informative than common abductive explanations since e.g. they allow us to see if the exact value of a feature is important, or it could be any nearby value. Overall they allow us to better understand the role of each feature in the decision. We show that we can compute inflated explanations for not that much greater cost than abductive explanations, and that we can extend duality results for abductive explanations also to inflated explanations. \ No newline at end of file diff --git a/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification b/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification new file mode 100644 index 0000000000..016512d49b --- /dev/null +++ b/data/2024/aaai/Delving into Multimodal Prompting for Fine-Grained Visual Classification @@ -0,0 +1 @@ +Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC. \ No newline at end of file diff --git a/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World b/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World new file mode 100644 index 0000000000..dc8e641ea0 --- /dev/null +++ b/data/2024/aaai/Demystifying Algorithmic Fairness in an Uncertain World @@ -0,0 +1 @@ +Significant progress in the field of fair machine learning (ML) has been made to counteract algorithmic discrimination against marginalized groups. However, fairness remains an active research area that is far from settled. One key bottleneck is the implicit assumption that environments, where ML is developed and deployed, are certain and reliable. In a world that is characterized by volatility, uncertainty, complexity, and ambiguity, whether what has been developed in algorithmic fairness can still serve its purpose is far from obvious. In this talk, I will first discuss how to improve algorithmic fairness under two kinds of predictive uncertainties, i.e., aleatoric uncertainty (i.e., randomness and ambiguity in the data) and epistemic uncertainty (i.e., a lack of data or knowledge), respectively. The former regards historical bias reflected in the data and the latter corresponds to the bias perpetuated or amplified during model training due to lack of data or knowledge. In particular, the first work studies pushing the fairness-utility trade-off through aleatoric uncertainty, and the second work investigates fair few-shot learning. The last work introduces coverage-based fairness that ensures different groups enjoy identical treatment and receive equal coverage. \ No newline at end of file diff --git a/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning b/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning new file mode 100644 index 0000000000..0d21b8e593 --- /dev/null +++ b/data/2024/aaai/DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning @@ -0,0 +1 @@ +Contrastive-learning-based methods have dominated sentence representation learning. These methods regularize the representation space by pulling similar sentence representations closer and pushing away the dissimilar ones and have been proven effective in various NLP tasks, e.g., semantic textual similarity (STS) tasks. However, it is challenging for these methods to learn fine-grained semantics as they only learn from the inter-sentence perspective, i.e., their supervision signal comes from the relationship between data samples. In this work, we propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks, standing up well in comparison to contrastive-learning-based methods. Notably, the proposed intra-sentence denoising objective complements existing inter-sentence contrastive methodologies and can be integrated with them to further enhance performance. Our code is available at https://github.com/xinghaow99/DenoSent. \ No newline at end of file diff --git a/data/2024/aaai/Dense Projection for Anomaly Detection b/data/2024/aaai/Dense Projection for Anomaly Detection new file mode 100644 index 0000000000..8e3e003cc6 --- /dev/null +++ b/data/2024/aaai/Dense Projection for Anomaly Detection @@ -0,0 +1 @@ +This work presents a novel method called dense projection for unsupervised anomaly detection (DPAD). The main idea is maximizing the local density of (normal) training data and then determining whether a test data is anomalous or not by evaluating its density. Specifically, DPAD uses a deep neural network to learn locally dense representations of normal data. Since density estimation is computationally expensive, we minimize the local distances of the representations in an iteratively reweighting manner, where the weights are updated adaptively and the parameters are regularized to avoid model collapse (all representations collapse to a single point). Compared with many state-of-the-art methods of anomaly detection, our DPAD does not rely on any assumption about the distribution or spatial structure of the normal data and representations. Moreover, we provide theoretical guarantees for the effectiveness of DPAD. The experiments show that our method DPAD is effective not only in traditional one-class classification problems but also in scenarios with complex normal data composed of multiple classes. \ No newline at end of file diff --git a/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation b/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation new file mode 100644 index 0000000000..ee2758eda9 --- /dev/null +++ b/data/2024/aaai/Density Matters: Improved Core-Set for Active Domain Adaptive Segmentation @@ -0,0 +1 @@ +Active domain adaptation has emerged as a solution to balance the expensive annotation cost and the performance of trained models in semantic segmentation. However, existing works usually ignore the correlation between selected samples and its local context in feature space, which leads to inferior usage of annotation budgets. In this work, we revisit the theoretical bound of the classical Core-set method and identify that the performance is closely related to the local sample distribution around selected samples. To estimate the density of local samples efficiently, we introduce a local proxy estimator with Dynamic Masked Convolution and develop a Density-aware Greedy algorithm to optimize the bound. Extensive experiments demonstrate the superiority of our approach. Moreover, with very few labels, our scheme achieves comparable performance to the fully supervised counterpart. \ No newline at end of file diff --git a/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection b/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection new file mode 100644 index 0000000000..c8b3c620c7 --- /dev/null +++ b/data/2024/aaai/Dependency Structure-Enhanced Graph Attention Networks for Event Detection @@ -0,0 +1,2 @@ +Existing models on event detection share three-fold limitations, including (1) insufficient consideration of the structures between dependency relations, (2) limited exploration of the directed-edge semantics, and (3) issues in strengthening the event core arguments. To tackle these problems, we propose a dependency structure-enhanced event detection framework. In addition to the traditional token dependency parsing tree, denoted as TDG, our model considers the dependency edges in it as new nodes and constructs a dependency relation graph (DRG). DRG allows the embedding representations of dependency relations to be updated as nodes rather than edges in a graph neural network. +Moreover, the levels of core argument nodes in the two graphs are adjusted by dependency relation types in TDG to enhance their status. Subsequently, the two graphs are further encoded and jointly trained in graph attention networks (GAT). Importantly, we design an interaction strategy of node embedding for the two graphs and refine the attention coefficient computational method to encode the semantic meaning of directed edges. Extensive experiments are conducted to validate the effectiveness of our method, and the results confirm its superiority over the state-of-the-art baselines. Our model outperforms the best benchmark with the F1 score increased by 3.5 and 3.4 percentage points on ACE2005 English and Chinese corpus. \ No newline at end of file diff --git a/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria b/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria new file mode 100644 index 0000000000..d3aa619a96 --- /dev/null +++ b/data/2024/aaai/Deploying ADVISER: Impact and Lessons from Using Artificial Intelligence for Child Vaccination Uptake in Nigeria @@ -0,0 +1 @@ +More than 5 million children under five years die from largely preventable or treatable medical conditions every year, with an overwhelmingly large proportion of deaths occurring in underdeveloped countries with low vaccination uptake. One of the United Nations' sustainable development goals (SDG 3) aims to end preventable deaths of newborns and children under five years of age. We focus on Nigeria, where the rate of infant mortality is appalling. In particular, low vaccination uptake in Nigeria is a major driver of more than 2,000 daily deaths of children under the age of five years. In this paper, we describe our collaboration with government partners in Nigeria to deploy ADVISER: AI-Driven Vaccination Intervention Optimiser. The framework, based on an integer linear program that seeks to maximize the cumulative probability of successful vaccination, is the first successful deployment of an AI-enabled toolchain for optimizing the allocation of health interventions in Nigeria. In this paper, we provide a background of the ADVISER framework and present results, lessons, and success stories of deploying ADVISER to more than 13,000 families in the state of Oyo, Nigeria. \ No newline at end of file diff --git a/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning b/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning new file mode 100644 index 0000000000..48c23a66d2 --- /dev/null +++ b/data/2024/aaai/Depression Detection via Capsule Networks with Contrastive Learning @@ -0,0 +1 @@ +Depression detection is a challenging and crucial task in psychological illness diagnosis. Utilizing online user posts to predict whether a user suffers from depression seems an effective and promising direction. However, existing methods suffer from either poor interpretability brought by the black-box models or underwhelming performance caused by the completely separate two-stage model structure. To alleviate these limitations, we propose a novel capsule network integrated with contrastive learning for depression detection (DeCapsNet). The highlights of DeCapsNet can be summarized as follows. First, it extracts symptom capsules from user posts by leveraging meticulously designed symptom descriptions, and then distills them into class-indicative depression capsules. The overall workflow is in an explicit hierarchical reasoning manner and can be well interpreted by the Patient Health Questionnaire-9 (PHQ9), which is one of the most widely adopted questionnaires for depression diagnosis. Second, it integrates with contrastive learning, which can facilitate the embeddings from the same class to be pulled closer, while simultaneously pushing the embeddings from different classes apart. In addition, by adopting the end-to-end training strategy, it does not necessitate additional data annotation, and mitigates the potential adverse effects from the upstream task to the downstream task. Extensive experiments on three widely-used datasets show that in both within-dataset and cross-dataset scenarios our proposed method outperforms other strong baselines significantly. \ No newline at end of file diff --git a/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views b/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views new file mode 100644 index 0000000000..950d895a76 --- /dev/null +++ b/data/2024/aaai/Depth-Guided Robust and Fast Point Cloud Fusion NeRF for Sparse Input Views @@ -0,0 +1 @@ +Novel-view synthesis with sparse input views is important for real-world applications like AR/VR and autonomous driving. Recent methods have integrated depth information into NeRFs for sparse input synthesis, leveraging depth prior for geometric and spatial understanding. However, most existing works tend to overlook inaccuracies within depth maps and have low time efficiency. To address these issues, we propose a depth-guided robust and fast point cloud fusion NeRF for sparse inputs. We perceive radiance fields as an explicit voxel grid of features. A point cloud is constructed for each input view, characterized within the voxel grid using matrices and vectors. We accumulate the point cloud of each input view to construct the fused point cloud of the entire scene. Each voxel determines its density and appearance by referring to the point cloud of the entire scene. Through point cloud fusion and voxel grid fine-tuning, inaccuracies in depth values are refined or substituted by those from other views. Moreover, our method can achieve faster reconstruction and greater compactness through effective vector-matrix decomposition. Experimental results underline the superior performance and time efficiency of our approach compared to state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model b/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model new file mode 100644 index 0000000000..06f536a75b --- /dev/null +++ b/data/2024/aaai/Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model @@ -0,0 +1 @@ +A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies has become an indispensable task for many products, it has not been systematically explored, and to the best of our knowledge, no public datasets are available. In this paper, we define this problem as Descanning and introduce a new high-quality and large-scale dataset named DESCAN-18K. It contains 18K pairs of original and scanned images collected in the wild containing multiple complex degradations. In order to eliminate such complex degradations, we propose a new image restoration model called DescanDiffusion consisting of a color encoder that corrects the global color degradation and a conditional denoising diffusion probabilistic model (DDPM) that removes local degradations. To further improve the generalization ability of DescanDiffusion, we also design a synthetic data generation scheme by reproducing prominent degradations in scanned images. We demonstrate that our DescanDiffusion outperforms other baselines including commercial restoration products, objectively and subjectively, via comprehensive experiments and analyses. \ No newline at end of file diff --git a/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning b/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning new file mode 100644 index 0000000000..c182c4f04d --- /dev/null +++ b/data/2024/aaai/Designing Biological Sequences without Prior Knowledge Using Evolutionary Reinforcement Learning @@ -0,0 +1 @@ +Designing novel biological sequences with desired properties is a significant challenge in biological science because of the extra large search space. The traditional design process usually involves multiple rounds of costly wet lab evaluations. To reduce the need for expensive wet lab experiments, machine learning methods are used to aid in designing biological sequences. However, the limited availability of biological sequences with known properties hinders the training of machine learning models, significantly restricting their applicability and performance. To fill this gap, we present ERLBioSeq, an Evolutionary Reinforcement Learning algorithm for BIOlogical SEQuence design. ERLBioSeq leverages the capability of reinforcement learning to learn without prior knowledge and the potential of evolutionary algorithms to enhance the exploration of reinforcement learning in the large search space of biological sequences. Additionally, to enhance the efficiency of biological sequence design, we developed a predictor for sequence screening in the biological sequence design process, which incorporates both the local and global sequence information. We evaluated the proposed method on three main types of biological sequence design tasks, including the design of DNA, RNA, and protein. The results demonstrate that the proposed method achieves significant improvement compared to the existing state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector b/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector new file mode 100644 index 0000000000..06dcc5bbe8 --- /dev/null +++ b/data/2024/aaai/Detect Any Keypoints: An Efficient Light-Weight Few-Shot Keypoint Detector @@ -0,0 +1 @@ +Recently the prompt-based models have become popular across various language and vision tasks. Following that trend, we perform few-shot keypoint detection (FSKD) by detecting any keypoints in a query image, given the prompts formed by support images and keypoints. FSKD can be applied to detecting keypoints and poses of diverse animal species. In order to maintain flexibility of detecting varying number of keypoints, existing FSKD approaches modulate query feature map per support keypoint, then detect the corresponding keypoint from each modulated feature via a detection head. Such a separation of modulation-detection makes model heavy and slow when the number of keypoints increases. To overcome this issue, we design a novel light-weight detector which combines modulation and detection into one step, with the goal of reducing the computational cost without the drop of performance. Moreover, to bridge the large domain shift of keypoints between seen and unseen species, we further improve our model with mean feature based contrastive learning to align keypoint distributions, resulting in better keypoint representations for FSKD. Compared to the state of the art, our light-weight detector reduces the number of parameters by 50%, training/test time by 50%, and achieves 5.62% accuracy gain on 1-shot novel keypoint detection in the Animal pose dataset. Our model is also robust to the number of keypoints and saves memory when evaluating a large number of keypoints (e.g., 1000) per episode. \ No newline at end of file diff --git a/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models b/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models new file mode 100644 index 0000000000..00e7451916 --- /dev/null +++ b/data/2024/aaai/Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models @@ -0,0 +1 @@ +Large language models like ChatGPT can generate human-like code, posing challenges for programming education as students may be tempted to misuse them on assignments. However, there are currently no robust detectors designed specifically to identify AI-generated code. This is an issue that needs to be addressed to maintain academic integrity while allowing proper utilization of language models. Previous work has explored different approaches to detect AI-generated text, including watermarks, feature analysis, and fine-tuning language models. In this paper, we address the challenge of determining whether a student's code assignment was generated by a language model. First, our proposed method identifies AI-generated code by leveraging targeted masking perturbation paired with comperhesive scoring. Rather than applying a random mask, areas of the code with higher perplexity are more intensely masked. Second, we utilize a fine-tuned CodeBERT to fill in the masked portions, producing subtle modified samples. Then, we integrate the overall perplexity, variation of code line perplexity, and burstiness into a unified score. In this scoring scheme, a higher rank for the original code suggests it's more likely to be AI-generated. This approach stems from the observation that AI-generated codes typically have lower perplexity. Therefore, perturbations often exert minimal influence on them. Conversely, sections of human-composed codes that the model struggles to understand can see their perplexity reduced by such perturbations. Our method outperforms current open-source and commercial text detectors. Specifically, it improves detection of code submissions generated by OpenAI's text-davinci-003, raising average AUC from 0.56 (GPTZero baseline) to 0.87 for our detector. \ No newline at end of file diff --git a/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models b/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models new file mode 100644 index 0000000000..d288561efd --- /dev/null +++ b/data/2024/aaai/Detecting and Preventing Hallucinations in Large Vision Language Models @@ -0,0 +1 @@ +Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling (RS). We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores. The dataset is available at https://github.com/hendryx-scale/mhal-detect. \ No newline at end of file diff --git a/data/2024/aaai/Detection and Defense of Unlearnable Examples b/data/2024/aaai/Detection and Defense of Unlearnable Examples new file mode 100644 index 0000000000..bb33645df3 --- /dev/null +++ b/data/2024/aaai/Detection and Defense of Unlearnable Examples @@ -0,0 +1 @@ +Privacy preserving has become increasingly critical with the emergence of social media. Unlearnable examples have been proposed to avoid leaking personal information on the Internet by degrading the generalization abilities of deep learning models. However, our study reveals that unlearnable examples are easily detectable. We provide theoretical results on linear separability of certain unlearnable poisoned dataset and simple network-based detection methods that can identify all existing unlearnable examples, as demonstrated by extensive experiments. Detectability of unlearnable examples with simple networks motivates us to design a novel defense method. We propose using stronger data augmentations coupled with adversarial noises generated by simple networks, to degrade the detectability and thus provide effective defense against unlearnable examples with a lower cost. Adversarial training with large budgets is a widely-used defense method on unlearnable examples. We establish quantitative criteria between the poison and adversarial budgets, which determine the existence of robust unlearnable examples or the failure of the adversarial defense. \ No newline at end of file diff --git a/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering b/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering new file mode 100644 index 0000000000..c78934080e --- /dev/null +++ b/data/2024/aaai/Detection-Based Intermediate Supervision for Visual Question Answering @@ -0,0 +1 @@ +Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches. \ No newline at end of file diff --git a/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion b/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion new file mode 100644 index 0000000000..a941c563a2 --- /dev/null +++ b/data/2024/aaai/Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion @@ -0,0 +1 @@ +Vignetting commonly occurs as a degradation in images resulting from factors such as lens design, improper lens hood usage, and limitations in camera sensors. This degradation affects image details, color accuracy, and presents challenges in computational photography. Existing vignetting removal algorithms predominantly rely on ideal physics assumptions and hand-crafted parameters, resulting in the ineffective removal of irregular vignetting and suboptimal results. Moreover, the substantial lack of real-world vignetting datasets hinders the objective and comprehensive evaluation of vignetting removal. To address these challenges, we present VigSet, a pioneering dataset for vignetting removal. VigSet includes 983 pairs of both vignetting and vignetting-free high-resolution (over 4k) real-world images under various conditions. In addition, We introduce DeVigNet, a novel frequency-aware Transformer architecture designed for vignetting removal. Through the Laplacian Pyramid decomposition, we propose the Dual Aggregated Fusion Transformer to handle global features and remove vignetting in the low-frequency domain. Additionally, we propose the Adaptive Channel Expansion Module to enhance details in the high-frequency domain. The experiments demonstrate that the proposed model outperforms existing state-of-the-art methods. The code, models, and dataset are available at https://github.com/CXH-Research/DeVigNet. \ No newline at end of file diff --git a/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System b/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System new file mode 100644 index 0000000000..35fceea764 --- /dev/null +++ b/data/2024/aaai/DexFuncGrasp: A Robotic Dexterous Functional Grasp Dataset Constructed from a Cost-Effective Real-Simulation Annotation System @@ -0,0 +1 @@ +Robot grasp dataset is the basis of designing the robot's grasp generation model. Compared with the building grasp dataset for Low-DOF grippers, it is harder for High-DOF dexterous robot hand. Most current datasets meet the needs of generating stable grasps, but they are not suitable for dexterous hands to complete human-like functional grasp, such as grasp the handle of a cup or pressing the button of a flashlight, so as to enable robots to complete subsequent functional manipulation action autonomously, and there is no dataset with functional grasp pose annotations at present. This paper develops a unique Cost-Effective Real-Simulation Annotation System by leveraging natural hand's actions. The system is able to capture a functional grasp of a dexterous hand in a simulated environment assisted by human demonstration in real world. By using this system, dexterous grasp data can be collected efficiently as well as cost-effective. Finally, we construct the first dexterous functional grasp dataset with rich pose annotations. A Functional Grasp Synthesis Model is also provided to validate the effectiveness of the proposed system and dataset. Our project page is: https://hjlllll.github.io/DFG/. \ No newline at end of file diff --git a/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels b/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels new file mode 100644 index 0000000000..1a92cec0b4 --- /dev/null +++ b/data/2024/aaai/DiDA: Disambiguated Domain Alignment for Cross-Domain Retrieval with Partial Labels @@ -0,0 +1 @@ +Driven by generative AI and the Internet, there is an increasing availability of a wide variety of images, leading to the significant and popular task of cross-domain image retrieval. To reduce annotation costs and increase performance, this paper focuses on an untouched but challenging problem, i.e., cross-domain image retrieval with partial labels (PCIR). Specifically, PCIR faces great challenges due to the ambiguous supervision signal and the domain gap. To address these challenges, we propose a novel method called disambiguated domain alignment (DiDA) for cross-domain retrieval with partial labels. In detail, DiDA elaborates a novel prototype-score unitization learning mechanism (PSUL) to extract common discriminative representations by simultaneously disambiguating the partial labels and narrowing the domain gap. Additionally, DiDA proposes a prototype-based domain alignment mechanism (PBDA) to further bridge the inherent cross-domain discrepancy. Attributed to PSUL and PBDA, our DiDA effectively excavates domain-invariant discrimination for cross-domain image retrieval. We demonstrate the effectiveness of DiDA through comprehensive experiments on three benchmarks, comparing it to existing state-of-the-art methods. Code available: https://github.com/lhrrrrrr/DiDA. \ No newline at end of file diff --git a/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph b/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph new file mode 100644 index 0000000000..a80f780092 --- /dev/null +++ b/data/2024/aaai/DiG-In-GNN: Discriminative Feature Guided GNN-Based Fraud Detector against Inconsistencies in Multi-Relation Fraud Graph @@ -0,0 +1 @@ +Fraud detection on multi-relation graphs aims to identify fraudsters in graphs. Graph Neural Network (GNN) models leverage graph structures to pass messages from neighbors to the target nodes, thereby enriching the representations of those target nodes. However, feature and structural inconsistency in the graph, owing to fraudsters' camouflage behaviors, diminish the suspiciousness of fraud nodes which hinders the effectiveness of GNN-based models. In this work, we propose DiG-In-GNN, Discriminative Feature Guided GNN against Inconsistency, to dig into graphs for fraudsters. Specifically, we use multi-scale contrastive learning from the perspective of the neighborhood subgraph where the target node is located to generate guidance nodes to cope with the feature inconsistency. Then, guided by the guidance nodes, we conduct fine-grained neighbor selection through reinforcement learning for each neighbor node to precisely filter nodes that can enhance the message passing and therefore alleviate structural inconsistency. Finally, the two modules are integrated together to obtain discriminable representations of the nodes. Experiments on three fraud detection datasets demonstrate the superiority of the proposed method DiG-In-GNN, which obtains up to 20.73% improvement over previous state-of-the-art methods. Our code can be found at https://github.com/GraphBerry/DiG-In-GNN. \ No newline at end of file diff --git "a/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization" "b/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization" new file mode 100644 index 0000000000..bf1a516b56 --- /dev/null +++ "b/data/2024/aaai/DiSCO: Diffusion Schr\303\266dinger Bridge for Molecular Conformer Optimization" @@ -0,0 +1 @@ +The generation of energetically optimal 3D molecular conformers is crucial in cheminformatics and drug discovery. While deep generative models have been utilized for direct generation in Euclidean space, this approach encounters challenges, including the complexity of navigating a vast search space. Recent generative models that implement simplifications to circumvent these challenges have achieved state-of-the-art results, but this simplified approach unavoidably creates a gap between the generated conformers and the ground-truth conformational landscape. To bridge this gap, we introduce DiSCO: Diffusion Schrödinger Bridge for Molecular Conformer Optimization, a novel diffusion framework that enables direct learning of nonlinear diffusion processes in prior-constrained Euclidean space for the optimization of 3D molecular conformers. Through the incorporation of an SE(3)-equivariant Schrödinger bridge, we establish the roto-translational equivariance of the generated conformers. Our framework is model-agnostic and offers an easily implementable solution for the post hoc optimization of conformers produced by any generation method. Through comprehensive evaluations and analyses, we establish the strengths of our framework, substantiating the application of the Schrödinger bridge for molecular conformer optimization. First, our approach consistently outperforms four baseline approaches, producing conformers with higher diversity and improved quality. Then, we show that the intermediate conformers generated during our diffusion process exhibit valid and chemically meaningful characteristics. We also demonstrate the robustness of our method when starting from conformers of diverse quality, including those unseen during training. Lastly, we show that the precise generation of low-energy conformers via our framework helps in enhancing the downstream prediction of molecular properties. The code is available at https://github.com/Danyeong-Lee/DiSCO. \ No newline at end of file diff --git a/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach b/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach new file mode 100644 index 0000000000..a6cd16d7b3 --- /dev/null +++ b/data/2024/aaai/Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach @@ -0,0 +1 @@ +Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels deconfounded from the environments, advancing the technical roadmap of out-of-distribution (OOD) generalization. Despite spotlights around, recent theoretical result verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains. The fake invariance severely endangers OOD generalization since the trustful objective can not be diagnosed and existing causal remedies are invalid to rectify. In this paper, we review a IRL family (InvRat) under the Partially and Fully Informative Invariant Feature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify their weaknesses in representing fake invariant features, then, unify their causal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally rebuild the spurious and the fake invariant features simultaneously. Given this, we further develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects. It can be easily implemented by a small feature selection subnet introduced in the IRL family, which is alternatively optimized to achieve our goal. Experiments verified the superiority of our approach to fight against the fake invariant issue across a variety of OOD generalization benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning b/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning new file mode 100644 index 0000000000..e0ac9a0b10 --- /dev/null +++ b/data/2024/aaai/Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Generation for Few-Shot Learning @@ -0,0 +1 @@ +Prompt-based pre-trained language models (PLMs) paradigm has succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization (DP_2O) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.62M parameters on the tasks in the few-shot setting, DP_2O outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that DP_2O has good universality, robustness and generalization ability. \ No newline at end of file diff --git a/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation b/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation new file mode 100644 index 0000000000..2c9d967629 --- /dev/null +++ b/data/2024/aaai/Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation @@ -0,0 +1,2 @@ +The generation of logically coherent dialogues by humans relies on underlying cognitive abilities. Based on this, we redefine the dialogue coherence evaluation process, combining cognitive judgment with the basic text to achieve a more human-like evaluation. We propose a novel dialogue evaluation framework based on Dialogue Cognition Graph (DCGEval) to implement the fusion by in-depth interaction between cognition modeling and text modeling. The proposed Abstract Meaning Representation (AMR) based graph structure called DCG aims to uniformly model four dialogue cognitive abilities. Specifically, core-semantic cognition is modeled by converting the utterance into an AMR graph, which can extract essential semantic information without redundancy. The temporal and role cognition are modeled by establishing logical relationships among the different AMR graphs. Finally, the commonsense knowledge from ConceptNet is fused to express commonsense cognition. Experiments demonstrate the necessity of modeling human cognition for +dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics. \ No newline at end of file diff --git a/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space b/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space new file mode 100644 index 0000000000..d5d879882b --- /dev/null +++ b/data/2024/aaai/DifAttack: Query-Efficient Black-Box Adversarial Attack via Disentangled Feature Space @@ -0,0 +1 @@ +This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We design a novel attack method based on a hierarchical DIsentangled Feature space, called \textbf{DifAttack++}, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack++ firstly disentangles an image's latent feature into an Adversarial Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder equipped with our specially designed Hierarchical Decouple-Fusion (\textbf{HDF}) module, where the AF dominates the adversarial capability of an image, while the VF largely determines its visual appearance. We train such two autoencoders for the clean and adversarial image domains (i.e., cross-domain) respectively to achieve image reconstructions and feature disentanglement, by using pairs of clean images and their Adversarial Examples (\textbf{AE}s) generated from available surrogate models via white-box attack methods. Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes the AF according to the query feedback from the victim model until a successful AE is generated, while keeping the VF unaltered. Extensive experimental results demonstrate that our DifAttack++ leads to superior ASR and query efficiency than state-of-the-art methods, meanwhile exhibiting much better visual quality of AEs. The code is available at https://github.com/csjunjun/DifAttack.git. \ No newline at end of file diff --git a/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning b/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning new file mode 100644 index 0000000000..c7c932335a --- /dev/null +++ b/data/2024/aaai/DiffAIL: Diffusion Adversarial Imitation Learning @@ -0,0 +1 @@ +Imitation learning aims to solve the problem of defining reward functions in real-world decision-making tasks. The current popular approach is the Adversarial Imitation Learning (AIL) framework, which matches expert state-action occupancy measures to obtain a surrogate reward for forward reinforcement learning. However, the traditional discriminator is a simple binary classifier and doesn't learn an accurate distribution, which may result in failing to identify expert-level state-action pairs induced by the policy interacting with the environment. To address this issue, we propose a method named diffusion adversarial imitation learning (DiffAIL), which introduces the diffusion model into the AIL framework. Specifically, DiffAIL models the state-action pairs as unconditional diffusion models and uses diffusion loss as part of the discriminator's learning objective, which enables the discriminator to capture better expert demonstrations and improve generalization. Experimentally, the results show that our method achieves state-of-the-art performance and significantly surpasses expert demonstration on two benchmark tasks, including the standard state-action setting and state-only settings. \ No newline at end of file diff --git a/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception b/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception new file mode 100644 index 0000000000..9822e126cb --- /dev/null +++ b/data/2024/aaai/DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception @@ -0,0 +1 @@ +BEV perception is of great importance in the field of autonomous driving, serving as the cornerstone of planning, controlling, and motion prediction. The quality of the BEV feature highly affects the performance of BEV perception. However, taking the noises in camera parameters and LiDAR scans into consideration, we usually obtain BEV representation with harmful noises. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to utilize the diffusion model to get a better BEV representation. In this work, we propose an end-to-end framework, named DiffBEV, to exploit the potential of diffusion model to generate a more comprehensive BEV representation. To the best of our knowledge, we are the first to apply diffusion model to BEV perception. In practice, we design three types of conditions to guide the training of the diffusion model which denoises the coarse samples and refines the semantic feature in a progressive way. What's more, a cross-attention module is leveraged to fuse the context of BEV feature and the semantic content of conditional diffusion model. DiffBEV achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the best-performing existing approach. Quantitative and qualitative results on multiple benchmarks demonstrate the effectiveness of DiffBEV in BEV semantic segmentation and 3D object detection tasks. \ No newline at end of file diff --git a/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images b/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images new file mode 100644 index 0000000000..f96d356aea --- /dev/null +++ b/data/2024/aaai/DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images @@ -0,0 +1 @@ +Deriving DSLR-quality sRGB images from smartphone RAW images has become a compelling challenge due to discernible detail disparity, color mapping instability, and spatial misalignment in RAW-sRGB data pairs. We present DiffRAW, a novel method that incorporates the diffusion model for the first time in learning RAW-to-sRGB mappings. By leveraging the diffusion model, our approach effectively learns the high-quality detail distribution of DSLR images, thereby enhancing the details of output images. Simultaneously, we use the RAW image as a diffusion condition to maintain image structure information such as contours and textures. To mitigate the interference caused by the color and spatial misalignment in training data pairs, we embed a color-position preserving condition within DiffRAW, ensuring that the output images do not exhibit color biases and pixel shift issues. To accelerate the inference process of DiffRAW, we designed the Domain Transform Diffusion Method, an efficient diffusion process with its corresponding reverse process. The Domain Transform Diffusion Method can reduce the required inference steps for diffusion model-based image restoration/enhancement algorithms while enhancing the quality of the generated images. Through evaluations on the ZRR dataset, DiffRAW consistently demonstrates state-of-the-art performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ), while achieving comparable results in PSNR and SSIM. \ No newline at end of file diff --git a/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion b/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion new file mode 100644 index 0000000000..df0fad041b --- /dev/null +++ b/data/2024/aaai/DiffSED: Sound Event Detection with Denoising Diffusion @@ -0,0 +1 @@ +Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split-and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED \ No newline at end of file diff --git a/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification b/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification new file mode 100644 index 0000000000..ed0470f24b --- /dev/null +++ b/data/2024/aaai/Differentiable Auxiliary Learning for Sketch Re-Identification @@ -0,0 +1 @@ +Sketch re-identification (Re-ID) seeks to match pedestrians' photos from surveillance videos with corresponding sketches. However, we observe that existing works still have two critical limitations: (i) cross- and intra-modality discrepancies hinder the extraction of modality-shared features, (ii) standard triplet loss fails to constrain latent feature distribution in each modality with inadequate samples. To overcome the above issues, we propose a differentiable auxiliary learning network (DALNet) to explore a robust auxiliary modality for Sketch Re-ID. Specifically, for (i) we construct an auxiliary modality by using a dynamic auxiliary generator (DAG) to bridge the gap between sketch and photo modalities. The auxiliary modality highlights the described person in photos to mitigate background clutter and learns sketch style through style refinement. Moreover, a modality interactive attention module (MIA) is presented to align the features and learn the invariant patterns of two modalities by auxiliary modality. To address (ii), we propose a multi-modality collaborative learning scheme (MMCL) to align the latent distribution of three modalities. An intra-modality circle loss in MMCL brings learned global and modality-shared features of the same identity closer in the case of insufficient samples within each modality. Extensive experiments verify the superior performance of our DALNet over the state-of-the-art methods for Sketch Re-ID, and the generalization in sketch-based image retrieval and sketch-photo face recognition tasks. \ No newline at end of file diff --git a/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification b/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification new file mode 100644 index 0000000000..83653c8950 --- /dev/null +++ b/data/2024/aaai/Diffusion Language-Shapelets for Semi-supervised Time-Series Classification @@ -0,0 +1 @@ +Semi-supervised time-series classification could effectively alleviate the issue of lacking labeled data. However, existing approaches usually ignore model interpretability, making it difficult for humans to understand the principles behind the predictions of a model. Shapelets are a set of discriminative subsequences that show high interpretability in time series classification tasks. Shapelet learning-based methods have demonstrated promising classification performance. Unfortunately, without enough labeled data, the shapelets learned by existing methods are often poorly discriminative, and even dissimilar to any subsequence of the original time series. To address this issue, we propose the Diffusion Language-Shapelets model (DiffShape) for semi-supervised time series classification. In DiffShape, a self-supervised diffusion learning mechanism is designed, which uses real subsequences as a condition. This helps to increase the similarity between the learned shapelets and real subsequences by using a large amount of unlabeled data. Furthermore, we introduce a contrastive language-shapelets learning strategy that improves the discriminability of the learned shapelets by incorporating the natural language descriptions of the time series. Experiments have been conducted on the UCR time series archive, and the results reveal that the proposed DiffShape method achieves state-of-the-art performance and exhibits superior interpretability over baselines. \ No newline at end of file diff --git a/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection b/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection new file mode 100644 index 0000000000..49cb413eea --- /dev/null +++ b/data/2024/aaai/DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection @@ -0,0 +1 @@ +Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge. \ No newline at end of file diff --git a/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking b/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking new file mode 100644 index 0000000000..e475b59c8e --- /dev/null +++ b/data/2024/aaai/DiffusionTrack: Diffusion Model for Multi-Object Tracking @@ -0,0 +1 @@ +Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods. Code is available at https://github.com/RainBowLuoCS/DiffusionTrack. \ No newline at end of file diff --git a/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract) b/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract) new file mode 100644 index 0000000000..abc700808f --- /dev/null +++ b/data/2024/aaai/Digital Twin-Driven Teat Localization and Shape Identification for Dairy Cow (Student Abstract) @@ -0,0 +1 @@ +Dairy owners invest heavily to keep their animals healthy. There is good reason to hope that technologies such as computer vision and artificial intelligence (AI) could reduce costs, yet obstacles arise when adapting these advanced tools to farming environments. In this work, we applied AI tools to dairy cow teat localization and teat shape classification, obtaining a model that achieves a mean average precision of 0.783. This digital twin-driven approach is intended as a first step towards automating and accelerating the detection and treatment of hyperkeratosis, mastitis, and other medical conditions that significantly burden the dairy industry. \ No newline at end of file diff --git a/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation b/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation new file mode 100644 index 0000000000..f412cdc002 --- /dev/null +++ b/data/2024/aaai/Direct Amortized Likelihood Ratio Estimation @@ -0,0 +1 @@ +We introduce a new amortized likelihood ratio estimator for likelihood-free simulation-based inference (SBI). Our estimator is simple to train and estimates the likelihood ratio using a single forward pass of the neural estimator. Our approach directly computes the likelihood ratio between two competing parameter sets which is different from the previous approach of comparing two neural network output values. We refer to our model as the direct neural ratio estimator (DNRE). As part of introducing the DNRE, we derive a corresponding Monte Carlo estimate of the posterior. We benchmark our new ratio estimator and compare to previous ratio estimators in the literature. We show that our new ratio estimator often outperforms these previous approaches. As a further contribution, we introduce a new derivative estimator for likelihood ratio estimators that enables us to compare likelihood-free Hamiltonian Monte Carlo (HMC) with random-walk Metropolis-Hastings (MH). We show that HMC is equally competitive, which has not been previously shown. Finally, we include a novel real-world application of SBI by using our neural ratio estimator to design a quadcopter. Code is available at https://github.com/SRI-CSL/dnre. \ No newline at end of file diff --git a/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation b/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation new file mode 100644 index 0000000000..fb3b57ce9c --- /dev/null +++ b/data/2024/aaai/Direct May Not Be the Best: An Incremental Evolution View of Pose Generation @@ -0,0 +1 @@ +Pose diversity is an inherent representative characteristic of 2D images. Due to the 3D to 2D projection mechanism, there is evident content discrepancy among distinct pose images. This is the main obstacle bothering pose transformation related researches. To deal with this challenge, we propose a fine-grained incremental evolution centered pose generation framework, rather than traditional direct one-to-one in a rush. Since proposed approach actually bypasses the theoretical difficulty of directly modeling dramatic non-linear variation, the incurred content distortion and blurring could be effectively constrained, at the same time the various individual pose details, especially clothes texture, could be precisely maintained. In order to systematically guide the evolution course, both global and incremental evolution constraints are elaborately designed and merged into the overall framework. And a novel triple-path knowledge fusion structure is worked out to take full advantage of all available valuable knowledge to conduct high-quality pose synthesis. In addition, our framework could generate a series of valuable by-products, namely the various intermediate poses. Extensive experiments have been conducted to verify the effectiveness of the proposed approach. Code is available at https://github.com/Xiaofei-CN/Incremental-Evolution-Pose-Generation. \ No newline at end of file diff --git a/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance b/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance new file mode 100644 index 0000000000..e9f3b45d59 --- /dev/null +++ b/data/2024/aaai/Directed Diffusion: Direct Control of Object Placement through Attention Guidance @@ -0,0 +1 @@ +Text-guided diffusion models such as DALLE-2, Imagen, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement. \ No newline at end of file diff --git "a/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning" "b/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning" new file mode 100644 index 0000000000..0a07818895 --- /dev/null +++ "b/data/2024/aaai/Direction-Aware Video Demoir\303\251ing with Temporal-Guided Bilateral Learning" @@ -0,0 +1 @@ +Moiré patterns occur when capturing images or videos on screens, severely degrading the quality of the captured images or videos. Despite the recent progresses, existing video demoiréing methods neglect the physical characteristics and formation process of moiré patterns, significantly limiting the effectiveness of video recovery. This paper presents a unified framework, DTNet, a direction-aware and temporal-guided bilateral learning network for video demoiréing. DTNet effectively incorporates the process of moiré pattern removal, alignment, color correction, and detail refinement. Our proposed DTNet comprises two primary stages: Frame-level Direction-aware Demoiréing and Alignment (FDDA) and Tone and Detail Refinement (TDR). In FDDA, we employ multiple directional DCT modes to perform the moiré pattern removal process in the frequency domain, effectively detecting the prominent moiré edges. Then, the coarse and fine-grained alignment is applied on the demoiréd features for facilitating the utilization of neighboring information. In TDR, we propose a temporal-guided bilateral learning pipeline to mitigate the degradation of color and details caused by the moiré patterns while preserving the restored frequency information in FDDA. Guided by the aligned temporal features from FDDA, the affine transformations for the recovery of the ultimate clean frames are learned in TDR. Extensive experiments demonstrate that our video demoiréing method outperforms state-of-the-art approaches by 2.3 dB in PSNR, and also delivers a superior visual experience. \ No newline at end of file diff --git a/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels b/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels new file mode 100644 index 0000000000..38ed44cd0f --- /dev/null +++ b/data/2024/aaai/Dirichlet-Based Prediction Calibration for Learning with Noisy Labels @@ -0,0 +1 @@ +Learning with noisy labels can significantly hinder the generalization performance of deep neural networks (DNNs). Existing approaches address this issue through loss correction or example selection methods. However, these methods often rely on the model's predictions obtained from the softmax function, which can be over-confident and unreliable. In this study, we identify the translation invariance of the softmax function as the underlying cause of this problem and propose the \textit{Dirichlet-based Prediction Calibration} (DPC) method as a solution. Our method introduces a calibrated softmax function that breaks the translation invariance by incorporating a suitable constant in the exponent term, enabling more reliable model predictions. To ensure stable model training, we leverage a Dirichlet distribution to assign probabilities to predicted labels and introduce a novel evidence deep learning (EDL) loss. The proposed loss function encourages positive and sufficiently large logits for the given label, while penalizing negative and small logits for other labels, leading to more distinct logits and facilitating better example selection based on a large-margin criterion. Through extensive experiments on diverse benchmark datasets, we demonstrate that DPC achieves state-of-the-art performance. The code is available at https://github.com/chenchenzong/DPC. \ No newline at end of file diff --git a/data/2024/aaai/Discerning Temporal Difference Learning b/data/2024/aaai/Discerning Temporal Difference Learning new file mode 100644 index 0000000000..fa9cf5e40f --- /dev/null +++ b/data/2024/aaai/Discerning Temporal Difference Learning @@ -0,0 +1 @@ +Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD(λ), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions—predetermined or adapted during training—to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Discovering Agents (Abstract Reprint) b/data/2024/aaai/Discovering Agents (Abstract Reprint) new file mode 100644 index 0000000000..24b4fa9495 --- /dev/null +++ b/data/2024/aaai/Discovering Agents (Abstract Reprint) @@ -0,0 +1 @@ +Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial – often the causal model is just assumed by the modeller without much justification – and modelling failures can lead to mistakes in the safety analysis. This paper proposes the first formal causal definition of agents – roughly that agents are systems that would adapt their policy if their actions influenced the world in a different way. From this we derive the first causal discovery algorithm for discovering the presence of agents from empirical data, given a set of variables and under certain assumptions. We also provide algorithms for translating between causal models and game-theoretic influence diagrams. We demonstrate our approach by resolving some previous confusions caused by incorrect causal modelling of agents. \ No newline at end of file diff --git a/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data b/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data new file mode 100644 index 0000000000..fbc6a7779e --- /dev/null +++ b/data/2024/aaai/Discovering Heterogeneous Causal Effects in Relational Data @@ -0,0 +1 @@ +Causal inference in relational data should account for the non-IID nature of the data and the interference phenomenon, which occurs when a unit's outcome is influenced by the treatments or outcomes of others. Existing solutions to causal inference under interference consider either homogeneous influence from peers or specific heterogeneous influence contexts (e.g., local neighborhood structure). This thesis investigates causal reasoning in relational data and the automated discovery of heterogeneous causal effects under arbitrary heterogeneous peer influence contexts and effect modification. \ No newline at end of file diff --git a/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays b/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays new file mode 100644 index 0000000000..4def908941 --- /dev/null +++ b/data/2024/aaai/Discovering Sequential Patterns with Predictable Inter-event Delays @@ -0,0 +1,2 @@ +Summarizing sequential data with serial episodes allows non-trivial insight into the data generating process. Existing methods penalize gaps in pattern occurrences equally, regardless of where in the pattern these occur. This results in a strong bias against patterns with long inter-event delays, and in addition that regularity in terms of delays is not rewarded or discovered---even though both aspects provide key insight. +In this paper we tackle both these problems by explicitly modeling inter-event delay distributions. That is, we are not only interested in discovering the patterns, but also in describing how many times steps typically occur between their individual events. We formalize the problem in terms of the Minimum Description Length principle, by which we say the best set of patterns is the one that compresses the data best. The resulting optimization problem does not lend itself to exact optimization, and hence we propose Hopper to heuristically mine high quality patterns. Extensive experiments show that Hopper efficiently recovers the ground truth, discovers meaningful patterns from real-world data, and outperforms existing methods in discovering long-delay patterns. \ No newline at end of file diff --git a/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition b/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition new file mode 100644 index 0000000000..e32e6b6635 --- /dev/null +++ b/data/2024/aaai/Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition @@ -0,0 +1,5 @@ +The knowledge distillation-based approaches have recently yielded state-of-the-art (SOTA) results for cross-lingual NER tasks in zero-shot scenarios. +These approaches typically employ a teacher network trained with the labelled source (rich-resource) language to infer pseudo-soft labels for the unlabelled target (zero-shot) language, and force a student network to approximate these pseudo labels to achieve knowledge transfer. +However, previous works have rarely discussed the issue of pseudo-label noise caused by the source-target language gap, which can mislead the training of the student network and result in negative knowledge transfer. +This paper proposes an discrepancy and uncertainty aware Denoising Knowledge Distillation model (DenKD) to tackle this issue. +Specifically, DenKD uses a discrepancy-aware denoising representation learning method to optimize the class representations of the target language produced by the teacher network, thus enhancing the quality of pseudo labels and reducing noisy predictions. Further, DenKD employs an uncertainty-aware denoising method to quantify the pseudo-label noise and adjust the focus of the student network on different samples during knowledge distillation, thereby mitigating the noise's adverse effects. We conduct extensive experiments on 28 languages including 4 languages not covered by the pre-trained models, and the results demonstrate the effectiveness of our DenKD. \ No newline at end of file diff --git a/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching b/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching new file mode 100644 index 0000000000..21b8d63a95 --- /dev/null +++ b/data/2024/aaai/Discrete Cycle-Consistency Based Unsupervised Deep Graph Matching @@ -0,0 +1 @@ +We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard supervised approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching. \ No newline at end of file diff --git a/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning b/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning new file mode 100644 index 0000000000..dae70e66cf --- /dev/null +++ b/data/2024/aaai/Discretionary Trees: Understanding Street-Level Bureaucracy via Machine Learning @@ -0,0 +1 @@ +Street-level bureaucrats interact directly with people on behalf of government agencies to perform a wide range of functions, including, for example, administering social services and policing. A key feature of street-level bureaucracy is that the civil servants, while tasked with implementing agency policy, are also granted significant discretion in how they choose to apply that policy in individual cases. Using that discretion could be beneficial, as it allows for exceptions to policies based on human interactions and evaluations, but it could also allow biases and inequities to seep into important domains of societal resource allocation. In this paper, we use machine learning techniques to understand street-level bureaucrats' behavior. We leverage a rich dataset that combines demographic and other information on households with information on which homelessness interventions they were assigned during a period when assignments were not formulaic. We find that caseworker decisions in this time are highly predictable overall, and some, but not all of this predictivity can be captured by simple decision rules. We theorize that the decisions not captured by the simple decision rules can be considered applications of caseworker discretion. These discretionary decisions are far from random in both the characteristics of such households and in terms of the outcomes of the decisions. Caseworkers typically only apply discretion to households that would be considered less vulnerable. When they do apply discretion to assign households to more intensive interventions, the marginal benefits to those households are significantly higher than would be expected if the households were chosen at random; there is no similar reduction in marginal benefit to households that are discretionarily allocated less intensive interventions, suggesting that caseworkers are using their knowledge and experience to improve outcomes for households experiencing homelessness. \ No newline at end of file diff --git a/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression b/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression new file mode 100644 index 0000000000..e5d1bdca5c --- /dev/null +++ b/data/2024/aaai/Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression @@ -0,0 +1 @@ +Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks. \ No newline at end of file diff --git a/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks b/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks new file mode 100644 index 0000000000..8f45e7a79e --- /dev/null +++ b/data/2024/aaai/Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks @@ -0,0 +1 @@ +Improving the diversity of Artificial Intelligence Generated Content (AIGC) is one of the fundamental problems in the theory of generative models such as generative adversarial networks (GANs). Previous studies have demonstrated that the discriminator in GANs should have high capacity and robustness to achieve the diversity of generated data. However, a discriminator with high capacity tends to overfit and guide the generator toward collapsed equilibrium. In this study, we propose a novel discriminative forest GAN, named Forest-GAN, that replaces the discriminator to improve the capacity and robustness for modeling statistics in real-world data distribution. A discriminative forest is composed of multiple independent discriminators built on bootstrapped data. We prove that a discriminative forest has a generalization error bound, which is determined by the strength of individual discriminators and the correlations among them. Hence, a discriminative forest can provide very large capacity without any risk of overfitting, which subsequently improves the generative diversity. With the discriminative forest framework, we significantly improved the performance of AutoGAN with a new record FID of 19.27 from 30.71 on STL10 and improved the performance of StyleGAN2-ADA with a new record FID of 6.87 from 9.22 on LSUN-cat. \ No newline at end of file diff --git a/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving b/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving new file mode 100644 index 0000000000..64e2407c2d --- /dev/null +++ b/data/2024/aaai/Discriminatively Fuzzy Multi-View K-means Clustering with Local Structure Preserving @@ -0,0 +1 @@ +Multi-view K-means clustering successfully generalizes K-means from single-view to multi-view, and obtains excellent clustering performance. In every view, it makes each data point close to the center of the corresponding cluster. However, multi-view K-means only considers the compactness of each cluster, but ignores the separability of different clusters, which is of great importance to producing a good clustering result. In this paper, we propose Discriminatively Fuzzy Multi-view K-means clustering with Local Structure Preserving (DFMKLS). On the basis of minimizing the distance between each data point and the center of the corresponding cluster, DFMKLS separates clusters by maximizing the distance between the centers of pairwise clusters. DFMKLS also relaxes its objective by introducing the idea of fuzzy clustering, which calculates the probability that a data point belongs to each cluster. Considering multi-view K-means mainly focuses on the global information of the data, to efficiently use the local information, we integrate the local structure preserving into the framework of DFMKLS. The effectiveness of DFMKLS is evaluated on benchmark multi-view datasets. It obtains superior performances than state-of-the-art multi-view clustering methods, including multi-view K-means. \ No newline at end of file diff --git a/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser b/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser new file mode 100644 index 0000000000..c88d1c9f47 --- /dev/null +++ b/data/2024/aaai/Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser @@ -0,0 +1 @@ +Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Disentangled Partial Label Learning b/data/2024/aaai/Disentangled Partial Label Learning new file mode 100644 index 0000000000..93615c4035 --- /dev/null +++ b/data/2024/aaai/Disentangled Partial Label Learning @@ -0,0 +1 @@ +Partial label learning (PLL) induces a multi-class classifier from training examples each associated with a set of candidate labels, among which only one is valid. The formation of real-world data typically arises from heterogeneous entanglement of series latent explanatory factors, which are considered intrinsic properties for discriminating between different patterns. Though learning disentangled representation is expected to facilitate label disambiguation for partial-label (PL) examples, few existing works were dedicated to addressing this issue. In this paper, we make the first attempt towards disentangled PLL and propose a novel approach named TERIAL, which makes predictions according to derived disentangled representation of instances and label embeddings. The TERIAL approach formulates the PL examples as an undirected bipartite graph where instances are only connected with their candidate labels, and employs a tailored neighborhood routing mechanism to yield disentangled representation of nodes in the graph. Specifically, the proposed routing mechanism progressively infers the explanatory factors that contribute to the edge between adjacent nodes and augments the representation of the central node with factor-aware embedding information propagated from specific neighbors simultaneously via iteratively analyzing the promising subspace clusters formed by the node and its neighbors. The estimated labeling confidence matrix is also introduced to accommodate unreliable links owing to the inherent ambiguity of PLL. Moreover, we theoretically prove that the neighborhood routing mechanism will converge to the point estimate that maximizes the marginal likelihood of observed PL training examples. Comprehensive experiments over various datasets demonstrate that our approach outperforms the state-of-the-art counterparts. \ No newline at end of file diff --git a/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract) b/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract) new file mode 100644 index 0000000000..8a29ed6435 --- /dev/null +++ b/data/2024/aaai/Disentanglement-Guided Spatial-Temporal Graph Neural Network for Metro Flow Forecasting (Student Abstract) @@ -0,0 +1 @@ +In recent intelligent transportation applications, metro flow forecasting has received much attention from researchers. Most prior arts endeavor to explore spatial or temporal dependencies while ignoring the key characteristic patterns underlying historical flows, e.g., trend and periodicity. Although the multiple granularity distillations or spatial dependency correlation can promote the flow estimation. However, the potential noise and spatial dynamics are under-explored. To this end, we propose a novel Disentanglement-Guided Spatial-Temporal Graph Neural Network or DGST to address the above concerns. It contains a Disentanglement Pre-training procedure for characteristic pattern disentanglement learning, a Characteristic Pattern Prediction for different future characteristic explorations, and a Spatial-Temporal Correlation for spatial-temporal dynamic learning. Experiments on a real-world dataset demonstrate the superiority of our DGST. \ No newline at end of file diff --git a/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses b/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses new file mode 100644 index 0000000000..aca11e8da4 --- /dev/null +++ b/data/2024/aaai/Disjoint Partial Enumeration without Blocking Clauses @@ -0,0 +1,2 @@ +A basic algorithm for enumerating disjoint propositional models (disjoint AllSAT) is based on adding blocking clauses incrementally, ruling out previously found models. On the one hand, blocking clauses have the potential to reduce the number of generated models exponentially, as they can handle partial models. On the other hand, the introduction of a large number of blocking clauses affects memory consumption and drastically slows down unit propagation. + We propose a new approach that allows for enumerating disjoint partial models with no need for blocking clauses by integrating: Conflict-Driven Clause-Learning (CDCL), Chronological Backtracking (CB), and methods for shrinking models (Implicant Shrinking). Experiments clearly show the benefits of our novel approach. \ No newline at end of file diff --git a/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance b/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance new file mode 100644 index 0000000000..2989cc231a --- /dev/null +++ b/data/2024/aaai/Dissenting Explanations: Leveraging Disagreement to Reduce Model Overreliance @@ -0,0 +1 @@ +While modern explanation methods have been shown to be inconsistent and contradictory, the explainability of black-box models nevertheless remains desirable. When the role of explanations extends from understanding models to aiding decision making, the semantics of explanations is not always fully understood – to what extent do explanations ``explain” a decision and to what extent do they merely advocate for a decision? Can we help humans gain insights from explanations accompanying correct predictions and not over-rely on incorrect predictions advocated for by explanations? With this perspective in mind, we introduce the notion of dissenting explanations: conflicting predictions with accompanying explanations. We first explore the advantage of dissenting explanations in the setting of model multiplicity, where multiple models with similar performance may have different predictions. Through a human study on the task of identifying deceptive reviews, we demonstrate that dissenting explanations reduce overreliance on model predictions, without reducing overall accuracy. Motivated by the utility of dissenting explanations we present both global and local methods for their generation. \ No newline at end of file diff --git a/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition b/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition new file mode 100644 index 0000000000..45f44d270f --- /dev/null +++ b/data/2024/aaai/DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition @@ -0,0 +1 @@ +The utilization of multi-modal sensor data in visual place recognition (VPR) has demonstrated enhanced performance compared to single-modal counterparts. Nonetheless, integrating additional sensors comes with elevated costs and may not be feasible for systems that demand lightweight operation, thereby impacting the practical deployment of VPR. To address this issue, we resort to knowledge distillation, which empowers single-modal students to learn from cross-modal teachers without introducing additional sensors during inference. Despite the notable advancements achieved by current distillation approaches, the exploration of feature relationships remains an under-explored area. In order to tackle the challenge of cross-modal distillation in VPR, we present DistilVPR, a novel distillation pipeline for VPR. We propose leveraging feature relationships from multiple agents, including self-agents and cross-agents for teacher and student neural networks. Furthermore, we integrate various manifolds, characterized by different space curvatures for exploring feature relationships. This approach enhances the diversity of feature relationships, including Euclidean, spherical, and hyperbolic relationship modules, thereby enhancing the overall representational capacity. The experiments demonstrate that our proposed pipeline achieves state-of-the-art performance compared to other distillation baselines. We also conduct necessary ablation studies to show design effectiveness. The code is released at: https://github.com/sijieaaa/DistilVPR \ No newline at end of file diff --git a/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed b/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed new file mode 100644 index 0000000000..496f602a52 --- /dev/null +++ b/data/2024/aaai/Distilling Autoregressive Models to Obtain High-Performance Non-autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed @@ -0,0 +1 @@ +Neural construction models have shown promising performance for Vehicle Routing Problems (VRPs) by adopting either the Autoregressive (AR) or Non-Autoregressive (NAR) learning approach. While AR models produce high-quality solutions, they generally have a high inference latency due to their sequential generation nature. Conversely, NAR models generate solutions in parallel with a low inference latency but generally exhibit inferior performance. In this paper, we propose a generic Guided Non-Autoregressive Knowledge Distillation (GNARKD) method to obtain high-performance NAR models having a low inference latency. GNARKD removes the constraint of sequential generation in AR models while preserving the learned pivotal components in the network architecture to obtain the corresponding NAR models through knowledge distillation. We evaluate GNARKD by applying it to three widely adopted AR models to obtain NAR VRP solvers for both synthesized and real-world instances. The experimental results demonstrate that GNARKD significantly reduces the inference time (4-5 times faster) with acceptable performance drop (2-3%). To the best of our knowledge, this study is first-of-its-kind to obtain NAR VRP solvers from AR ones through knowledge distillation. \ No newline at end of file diff --git a/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning b/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning new file mode 100644 index 0000000000..19c80b627e --- /dev/null +++ b/data/2024/aaai/Distilling Reliable Knowledge for Instance-Dependent Partial Label Learning @@ -0,0 +1 @@ +Partial label learning (PLL) refers to the classification task where each training instance is ambiguously annotated with a set of candidate labels. Despite substantial advancements in tackling this challenge, limited attention has been devoted to a more specific and realistic setting, denoted as instance-dependent partial label learning (IDPLL). Within this contex, the assignment of partial labels depends on the distinct features of individual instances, rather than being random. In this paper, we initiate an exploration into a self-distillation framework for this problem, driven by the proven effectiveness and stability of this framework. Nonetheless, a crucial shortfall is identified: the foundational assumption central to IDPLL, involving what we term as partial label knowledge stipulating that candidate labels should exhibit superior confidence compared to non-candidates, is not fully upheld within the distillation process. To address this challenge, we introduce DIRK, a novel distillation approach that leverages a rectification process to DIstill Reliable Knowledge, while concurrently preserves informative fine-grained label confidence. In addition, to harness the rectified confidence to its fullest potential, we propose a knowledge-based representation refinement module, seamlessly integrated into the DIRK framework. This module effectively transmits the essence of similarity knowledge from the label space to the feature space, thereby amplifying representation learning and subsequently engendering marked improvements in model performance. Experiments and analysis on multiple datasets validate the rationality and superiority of our proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval b/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval new file mode 100644 index 0000000000..f87cebcf69 --- /dev/null +++ b/data/2024/aaai/Distributed Manifold Hashing for Image Set Classification and Retrieval @@ -0,0 +1 @@ +Conventional image set methods typically learn from image sets stored in one location. However, in real-world applications, image sets are often distributed or collected across different positions. Learning from such distributed image sets presents a challenge that has not been studied thus far. Moreover, efficiency is seldom addressed in large-scale image set applications. To fulfill these gaps, this paper proposes Distributed Manifold Hashing (DMH), which models distributed image sets as a connected graph. DMH employs Riemannian manifold to effectively represent each image set and further suggests learning hash code for each image set to achieve efficient computation and storage. DMH is formally formulated as a distributed learning problem with local consistency constraint on global variables among neighbor nodes, and can be optimized in parallel. Extensive experiments on three benchmark datasets demonstrate that DMH achieves highly competitive accuracies in a distributed setting and provides faster classification and retrieval than state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond b/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond new file mode 100644 index 0000000000..a87eeecb7d --- /dev/null +++ b/data/2024/aaai/Distribution Matching for Multi-Task Learning of Classification Tasks: A Large-Scale Study on Faces & Beyond @@ -0,0 +1 @@ +Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space, or parameter transfer. To provide sufficient learning support, modern MTL uses annotated data with full, or sufficiently large overlap across tasks, i.e., each input sample is annotated for all, or most of the tasks. However, collecting such annotations is prohibitive in many real applications, and cannot benefit from datasets available for individual tasks. In this work, we challenge this setup and show that MTL can be successful with classification tasks with little, or non-overlapping annotations, or when there is big discrepancy in the size of labeled data per task. We explore task-relatedness for co-annotation and co-training, and propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching. To demonstrate the general applicability of our method, we conducted diverse case studies in the domains of affective computing, face recognition, species recognition, and shopping item classification using nine datasets. Our large-scale study of affective tasks for basic expression recognition and facial action unit detection illustrates that our approach is network agnostic and brings large performance improvements compared to the state-of-the-art in both tasks and across all studied databases. In all case studies, we show that co-training via task-relatedness is advantageous and prevents negative transfer (which occurs when MT model's performance is worse than that of at least one single-task model). \ No newline at end of file diff --git a/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation b/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation new file mode 100644 index 0000000000..8d00a718ab --- /dev/null +++ b/data/2024/aaai/Distribution-Conditioned Adversarial Variational Autoencoder for Valid Instrumental Variable Generation @@ -0,0 +1 @@ +Instrumental variables (IVs), widely applied in economics and healthcare, enable consistent counterfactual prediction in the presence of hidden confounding factors, effectively addressing endogeneity issues. The prevailing IV-based counterfactual prediction methods typically rely on the availability of valid IVs (satisfying Relevance, Exclusivity, and Exogeneity), a requirement which often proves elusive in real-world scenarios. Various data-driven techniques are being developed to create valid IVs (or representations of IVs) from a pool of IV candidates. However, most of these techniques still necessitate the inclusion of valid IVs within the set of candidates. This paper proposes a distribution-conditioned adversarial variational autoencoder to tackle this challenge. Specifically: 1) for Relevance and Exclusivity, we deduce the corresponding evidence lower bound following the Bayesian network structure and build the variational autoencoder; accordingly, 2) for Exogeneity , we design an adversarial game to encourage latent factors originating from the marginal distribution, compelling the independence between IVs and other outcome-related factors. Extensive experimental results validate the effectiveness, stability and generality of our proposed model in generating valid IV factors in the absence of valid IV candidates. \ No newline at end of file diff --git a/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations b/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations new file mode 100644 index 0000000000..c5044a1be2 --- /dev/null +++ b/data/2024/aaai/Distributional Off-Policy Evaluation for Slate Recommendations @@ -0,0 +1 @@ +Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures. \ No newline at end of file diff --git a/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation b/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation new file mode 100644 index 0000000000..e9dc53bc4c --- /dev/null +++ b/data/2024/aaai/Divergence-Guided Simultaneous Speech Translation @@ -0,0 +1 @@ +To achieve high-quality translation with low latency, a Simultaneous Speech Translation (SimulST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using ``wait-k'' policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (DiG-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefix-based strategy for training translation models with partial speech input, and develop an adaptive policy that makes read/write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the MuST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search b/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search new file mode 100644 index 0000000000..0d86a99e3d --- /dev/null +++ b/data/2024/aaai/Diverse Person: Customize Your Own Dataset for Text-Based Person Search @@ -0,0 +1 @@ +Text-based person search is a challenging task aimed at locating specific target pedestrians through text descriptions. Recent advancements have been made in this field, but there remains a deficiency in datasets tailored for text-based person search. The creation of new, real-world datasets is hindered by concerns such as the risk of pedestrian privacy leakage and the substantial costs of annotation. In this paper, we introduce a framework, named Diverse Person (DP), to achieve efficient and high-quality text-based person search data generation without involving privacy concerns. Specifically, we propose to leverage available images of clothing and accessories as reference attribute images to edit the original dataset images through diffusion models. Additionally, we employ a Large Language Model (LLM) to produce annotations that are both high in quality and stylistically consistent with those found in real-world datasets. Extensive experimental results demonstrate that the baseline models trained with our DP can achieve new state-of-the-art results on three public datasets, with performance improvements up to 4.82%, 2.15%, and 2.28% on CUHK-PEDES, ICFG-PEDES, and RSTPReid in terms of Rank-1 accuracy, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract) b/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract) new file mode 100644 index 0000000000..f5510e2f5b --- /dev/null +++ b/data/2024/aaai/Diverse Yet Biased: Towards Mitigating Biases in Generative AI (Student Abstract) @@ -0,0 +1 @@ +Generative Artificial Intelligence (AI) has garnered significant attention for its remarkable ability to generate text, images, and other forms of content. However, an inherent and increasingly concerning issue within generative AI systems is bias. These AI models often exhibit an Anglo-centric bias and tend to overlook the importance of diversity. This can be attributed to their training on extensive datasets sourced from the internet, which inevitably inherit the biases present in those data sources. Employing these datasets leads to AI-generated content that mirrors and perpetuates existing biases, encompassing various aspects such as gender, ethnic and cultural stereotypes. Addressing bias in generative AI is a complex challenge that necessitates substantial efforts. In order to tackle this issue, we propose a methodology for constructing moderately sized datasets with a social inclination. These datasets can be employed to rectify existing imbalances in datasets or to train models to generate socially inclusive material. Additionally, we present preliminary findings derived from training our model on these socially inclined datasets. \ No newline at end of file diff --git a/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation b/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation new file mode 100644 index 0000000000..69892d99fe --- /dev/null +++ b/data/2024/aaai/Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation @@ -0,0 +1 @@ +We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/. \ No newline at end of file diff --git a/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration b/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration new file mode 100644 index 0000000000..f73d824ccf --- /dev/null +++ b/data/2024/aaai/Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration @@ -0,0 +1 @@ +In recent years, following the success of text guided image generation, text guided 3D generation has gained increasing attention among researchers. Dreamfusion is a notable approach that enhances generation quality by utilizing 2D text guided diffusion models and introducing SDS loss, a technique for distilling 2D diffusion model information to train 3D models. However, the SDS loss has two major limitations that hinder its effectiveness. Firstly, when given a text prompt, the SDS loss struggles to produce diverse content. Secondly, during training, SDS loss may cause the generated content to overfit and collapse, limiting the model's ability to learn intricate texture details. To overcome these challenges, we propose a novel approach called Noise Recalibration algorithm. By incorporating this technique, we can generate 3D content with significantly greater diversity and stunning details. Our approach offers a promising solution to the limitations of SDS loss. \ No newline at end of file diff --git a/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification b/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification new file mode 100644 index 0000000000..5408d81008 --- /dev/null +++ b/data/2024/aaai/Diversity-Authenticity Co-constrained Stylization for Federated Domain Generalization in Person Re-identification @@ -0,0 +1 @@ +This paper tackles the problem of federated domain generalization in person re-identification (FedDG re-ID), aiming to learn a model generalizable to unseen domains with decentralized source domains. Previous methods mainly focus on preventing local overfitting. However, the direction of diversifying local data through stylization for model training is largely overlooked. This direction is popular in domain generalization but will encounter two issues under federated scenario: (1) Most stylization methods require the centralization of multiple domains to generate novel styles but this is not applicable under decentralized constraint. (2) The authenticity of generated data cannot be ensured especially given limited local data, which may impair the model optimization. To solve these two problems, we propose the Diversity-Authenticity Co-constrained Stylization (DACS), which can generate diverse and authentic data for learning robust local model. Specifically, we deploy a style transformation model on each domain to generate novel data with two constraints: (1) A diversity constraint is designed to increase data diversity, which enlarges the Wasserstein distance between the original and transformed data; (2) An authenticity constraint is proposed to ensure data authenticity, which enforces the transformed data to be easily/hardly recognized by the local-side global/local model. Extensive experiments demonstrate the effectiveness of the proposed DACS and show that DACS achieves state-of-the-art performance for FedDG re-ID. \ No newline at end of file diff --git a/data/2024/aaai/Divide and Conquer: Hybrid Pre-training for Person Search b/data/2024/aaai/Divide and Conquer: Hybrid Pre-training for Person Search new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data b/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data new file mode 100644 index 0000000000..8482afa0a5 --- /dev/null +++ b/data/2024/aaai/Divide-and-Aggregate Learning for Evaluating Performance on Unlabeled Data @@ -0,0 +1 @@ +Artificial Intelligence (AI) models have become an integral part of modern society, significantly improving human lives. However, ensuring the reliability and safety of these models is of paramount importance. One critical aspect is the continuous monitoring and verification of model performance to prevent any potential risks. Real-time online evaluation of AI models is necessary to maintain their effectiveness and mitigate any harm caused by performance degradation. The traditional approach to model evaluation involves supervised methods that rely on manual labeling to compare results with model predictions. Unfortunately, this method is not suitable for online model monitoring due to its inherent lag and high cost. While there have been attempts to explore free-label model evaluation, these approaches often consider only the global features of the entire dataset. Additionally, they can only perform model evaluation based on a single dimension of model confidence or features. In this paper, we propose a novel approach called Divide-and-Aggregate Learning (DAL) for unsupervised model evaluation. Our method addresses the limitations of previous approaches by dividing the output of the model into buckets, capturing local information of the distribution. We then aggregate this local information to obtain global information and further represent the relationship between the distribution and model performance. Importantly, our method can simultaneously handle the confidence distribution and feature distribution of the model output. Extensive experiments have been conducted to demonstrate the effectiveness of our DAL model. The results show that our approach outperforms previous methods on four widely used datasets. We will make our source code publicly available. \ No newline at end of file diff --git a/data/2024/aaai/DocFormerv2: Local Features for Document Understanding b/data/2024/aaai/DocFormerv2: Local Features for Document Understanding new file mode 100644 index 0000000000..80ef3bdd0b --- /dev/null +++ b/data/2024/aaai/DocFormerv2: Local Features for Document Understanding @@ -0,0 +1 @@ +We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU. \ No newline at end of file diff --git a/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding b/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding new file mode 100644 index 0000000000..c9f0348aa7 --- /dev/null +++ b/data/2024/aaai/DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding @@ -0,0 +1,10 @@ +Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. +However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. +In document-level news, sarcasm clues are sparse or small and are often concealed in long text. +Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. +Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. +To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). +Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. +The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. +To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. +Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU. \ No newline at end of file diff --git a/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations b/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations new file mode 100644 index 0000000000..430a287baf --- /dev/null +++ b/data/2024/aaai/DocNLC: A Document Image Enhancement Framework with Normalized and Latent Contrastive Representation for Multiple Degradations @@ -0,0 +1,2 @@ +Document Image Enhancement (DIE) remains challenging due to the prevalence of multiple degradations in document images captured by cameras. In this paper, we respond an interesting question: can the performance of pre-trained models and downstream DIE models be improved if they are bootstrapped using different degradation types of the same semantic samples and their high-dimensional features with ambiguous inter-class distance? To this end, we propose an effective contrastive learning paradigm for DIE — a Document image enhancement framework with Normalization and Latent Contrast (DocNLC). While existing DIE methods focus on eliminating one type of degradation, DocNLC considers the relationship between different types of degradation while utilizing both direct and latent contrasts to constrain content consistency, thus achieving a unified treatment of multiple types of degradation. Specifically, we devise a latent contrastive learning module to enforce explicit decorrelation of the normalized representations of different degradation types and to minimize the redundancy between them. Comprehensive experiments show that our method outperforms state-of-the-art DIE models in both pre-training and fine-tuning stages +on four publicly available independent datasets. In addition, we discuss the potential benefits of DocNLC for downstream tasks. Our code is released at https://github.com/RylonW/DocNLC \ No newline at end of file diff --git a/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes b/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes new file mode 100644 index 0000000000..56090f9af5 --- /dev/null +++ b/data/2024/aaai/Does Any AI-Based Activity Contribute to Develop AI Conception? A Case Study with Italian Fifth and Sixth Grade Classes @@ -0,0 +1,13 @@ +Artificial Intelligence is undoubtedly becoming pervasive in everyday life of everyone. +In this setting, developing correct AI conception since childhood is not only a need to +be addressed in educational curricula, but is also a children right. + +Accordingly, several initiatives at national and international levels aim at promoting AI +and emerging technology literacy, supported also by a proliferation in the literature +of learning courses covering a variety of topics, learning objectives and targeted ages. +Schools are therefore pushed to introduce innovative activities for children in their +curricula. + +In this paper, we report the results of a case study where we tested the contribution +of an AI block-based course in developing computational thinking, and human +and AI minds understanding in fifth and sixth grade children. \ No newline at end of file diff --git a/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks? b/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks? new file mode 100644 index 0000000000..adebc0cf1e --- /dev/null +++ b/data/2024/aaai/Does Few-Shot Learning Suffer from Backdoor Attacks? @@ -0,0 +1 @@ +The field of few-shot learning (FSL) has shown promising results in scenarios where training data is limited, but its vulnerability to backdoor attacks remains largely unexplored. We first explore this topic by first evaluating the performance of the existing backdoor attack methods on few-shot learning scenarios. Unlike in standard supervised learning, existing backdoor attack methods failed to perform an effective attack in FSL due to two main issues. Firstly, the model tends to overfit to either benign features or trigger features, causing a tough trade-off between attack success rate and benign accuracy. Secondly, due to the small number of training samples, the dirty label or visible trigger in the support set can be easily detected by victims, which reduces the stealthiness of attacks. It seemed that FSL could survive from backdoor attacks. However, in this paper, we propose the Few-shot Learning Backdoor Attack (FLBA) to show that FSL can still be vulnerable to backdoor attacks. Specifically, we first generate a trigger to maximize the gap between poisoned and benign features. It enables the model to learn both benign and trigger features, which solves the problem of overfitting. To make it more stealthy, we hide the trigger by optimizing two types of imperceptible perturbation, namely attractive and repulsive perturbation, instead of attaching the trigger directly. Once we obtain the perturbations, we can poison all samples in the benign support set into a hidden poisoned support set and fine-tune the model on it. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms while preserving clean accuracy and maintaining stealthiness. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention. \ No newline at end of file diff --git a/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling b/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling new file mode 100644 index 0000000000..db76a6e7b5 --- /dev/null +++ b/data/2024/aaai/Does Robin Hood Use a Lightsaber?: Automated Planning for Storytelling @@ -0,0 +1 @@ +Humans have been using stories to entertain, educate, and persuade audiences for centuries. The advent of modern AI tools in the form of Large Language Models (LLMs) such as chatGPT continues to fulfill this purpose. However while recent work has shown that LLMs can successfully be used for narrative generation, they lack coherence and can be prone to repetition and stilted language. Automated Planning can therefore be combined with Natural Language text generation to create narratives (stories) that are logical, coherent, and believable. A planning model provides scaffolding to an LLM so that the LLM's language generation is context-dependent, in order to allow users to create more coherent, logical, and believable stories in a variety of domains. \ No newline at end of file diff --git a/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies b/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies new file mode 100644 index 0000000000..b7f4e9912e --- /dev/null +++ b/data/2024/aaai/Domain Engineering to Represent Human Behavior Using Multi-Agent Planning and Inductive Methodologies @@ -0,0 +1 @@ +This research combines multi agent planning, the psycholinguistics of question asking, procedural grounded theory, and hierarchical task networks to represent domains for automated planning. \ No newline at end of file diff --git a/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset b/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset new file mode 100644 index 0000000000..2ab3c9d28b --- /dev/null +++ b/data/2024/aaai/Domain Generalizable Person Search Using Unreal Dataset @@ -0,0 +1,6 @@ +Collecting and labeling real datasets to train the person search networks not only requires a lot of time and effort, but also accompanies privacy issues. +The weakly-supervised and unsupervised domain adaptation methods have been proposed to alleviate the labeling burden for target datasets, however, their generalization capability is limited. +We introduce a novel person search method based on the domain generalization framework, that uses an automatically labeled unreal dataset only for training but is applicable to arbitrary unseen real datasets. +To alleviate the domain gaps when transferring the knowledge from the unreal source dataset to the real target datasets, we estimate the fidelity of person instances which is then used to train the end-to-end network adaptively. +Moreover, we devise a domain-invariant feature learning scheme to encourage the network to suppress the domain-related features. +Experimental results demonstrate that the proposed method provides the competitive performance to existing person search methods even though it is applicable to arbitrary unseen datasets without any prior knowledge and re-training burdens. \ No newline at end of file diff --git a/data/2024/aaai/Domain Generalization with Vital Phase Augmentation b/data/2024/aaai/Domain Generalization with Vital Phase Augmentation new file mode 100644 index 0000000000..38843ff516 --- /dev/null +++ b/data/2024/aaai/Domain Generalization with Vital Phase Augmentation @@ -0,0 +1 @@ +Deep neural networks have shown remarkable performance in image classification. However, their performance significantly deteriorates with corrupted input data. Domain generalization methods have been proposed to train robust models against out-of-distribution data. Data augmentation in the frequency domain is one of such approaches that enable a model to learn phase features to establish domain-invariant representations. This approach changes the amplitudes of the input data while preserving the phases. However, using fixed phases leads to susceptibility to phase fluctuations because amplitudes and phase fluctuations commonly occur in out-of-distribution. In this study, to address this problem, we introduce an approach using finite variation of the phases of input data rather than maintaining fixed phases. Based on the assumption that the degree of domain-invariant features varies for each phase, we propose a method to distinguish phases based on this degree. In addition, we propose a method called vital phase augmentation (VIPAug) that applies the variation to the phases differently according to the degree of domain-invariant features of given phases. The model depends more on the vital phases that contain more domain-invariant features for attaining robustness to amplitude and phase fluctuations. We present experimental evaluations of our proposed approach, which exhibited improved performance for both clean and corrupted data. VIPAug achieved SOTA performance on the benchmark CIFAR-10 and CIFAR-100 datasets, as well as near-SOTA performance on the ImageNet-100 and ImageNet datasets. Our code is available at https://github.com/excitedkid/vipaug. \ No newline at end of file diff --git a/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration b/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration new file mode 100644 index 0000000000..0469d6fde5 --- /dev/null +++ b/data/2024/aaai/Domain Invariant Learning for Gaussian Processes and Bayesian Exploration @@ -0,0 +1 @@ +Out-of-distribution (OOD) generalization has long been a challenging problem that remains largely unsolved. Gaussian processes (GP), as popular probabilistic model classes, especially in the small data regime, presume strong OOD generalization abilities. Surprisingly, their OOD generalization abilities have been under-explored before compared with other lines of GP research. In this paper, we identify that GP is not free from the problem and propose a domain invariant learning algorithm for Gaussian processes (DIL-GP) with a min-max optimization on the likelihood. DIL-GP discovers the heterogeneity in the data and forces invariance across partitioned subsets of data. We further extend the DIL-GP to improve Bayesian optimization's adaptability on changing environments. Numerical experiments demonstrate the superiority of DIL-GP for predictions on several synthetic and real-world datasets. We further demonstrate the effectiveness of the DIL-GP Bayesian optimization method on a PID parameters tuning experiment for a quadrotor. The full version and source code are available at: https://github.com/Billzxl/DIL-GP. \ No newline at end of file diff --git a/data/2024/aaai/Domain-Controlled Prompt Learning b/data/2024/aaai/Domain-Controlled Prompt Learning new file mode 100644 index 0000000000..90d46a7a38 --- /dev/null +++ b/data/2024/aaai/Domain-Controlled Prompt Learning @@ -0,0 +1 @@ +Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns. To tackle this dilemma, we proposed a Domain-Controlled Prompt Learning for the specific domains. Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets. Our code is available at https://github.com/caoql98/DCPL. \ No newline at end of file diff --git a/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing b/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing new file mode 100644 index 0000000000..2c7d75f93e --- /dev/null +++ b/data/2024/aaai/Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing @@ -0,0 +1,10 @@ +Multi-Domain Face Anti-Spoofing (MD-FAS) is a practical setting that aims to update models on new domains using only novel data while ensuring that the knowledge acquired from previous domains is not forgotten. +Prior methods utilize the responses from models to represent the previous domain knowledge or map the different domains into separated feature spaces to prevent forgetting. +However, due to domain gaps, the responses of new data are not as accurate as those of previous data. +Also, without the supervision of previous data, separated feature spaces might be destroyed by new domains while updating, leading to catastrophic forgetting. +Inspired by the challenges posed by the lack of previous data, we solve this issue from a new standpoint that generates hallucinated previous data for updating FAS model. +To this end, we propose a novel Domain-Hallucinated Updating (DHU) framework to facilitate the hallucination of data. +Specifically, Domain Information Explorer learns representative domain information of the previous domains. +Then, Domain Information Hallucination module transfers the new domain data to pseudo-previous domain ones. +Moreover, Hallucinated Features Joint Learning module is proposed to asymmetrically align the new and pseudo-previous data for real samples via dual levels to learn more generalized features, promoting the results on all domains. +Our experimental results and visualizations demonstrate that the proposed method outperforms state-of-the-art competitors in terms of effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/Double Auction on Diffusion Network b/data/2024/aaai/Double Auction on Diffusion Network new file mode 100644 index 0000000000..0c733c9e57 --- /dev/null +++ b/data/2024/aaai/Double Auction on Diffusion Network @@ -0,0 +1 @@ +Mechanism design on social networks has attracted extensive attention recently. The goal is to design mechanisms to incentivize participants to invite more participants via their social networks, and the challenge is that the participants are competitors. Various mechanisms have been proposed for single-/multiple-unit auctions, but it has been shown that it is challenging to design such mechanisms for more complex settings. We move this forward to investigate a double auction on a network where each trader (a buyer or a seller) can link to other buyers and sellers. Incentiving invitation is more difficult than in multi-unit one-sided auctions, because there are two different roles and a buyer (seller) seems happy to invite a seller (buyer), but again the invited seller (buyer) may invite another buyer (seller) to compete with the original buyer (seller). To combat this, we propose a solution called dynamic trade reduction (DTR), which also guarantees a non-negative revenue for the market owner. Interestingly, our solution is also applicable to the multi-unit one-sided auction when there is only one seller linking to only buyers on the network. We believe that the principle of our solution has the potential to be extended to design the multi-item one-sided auction. \ No newline at end of file diff --git a/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration b/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration new file mode 100644 index 0000000000..caeb77c99d --- /dev/null +++ b/data/2024/aaai/Double Buffers CEM-TD3: More Efficient Evolution and Richer Exploration @@ -0,0 +1 @@ +CEM-TD3 is a combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (TD3), and it achieves a satisfactory trade-off between performance and sample efficiency. However, we find that CEM-TD3 cannot fully address the low efficiency of policy search caused by CEM, and the policy gradient learning introduced by TD3 will weaken the diversity of individuals in the population. In this paper, we propose Double Buffers CEM-TD3 (DBCEM-TD3) that optimizes both CEM and TD3. For CEM, DBCEM-TD3 maintains an actor buffer to store the population required for evolution. In each iteration, it only needs to generate a small number of actors to replace the poor actors in the policy buffer to achieve more efficient evolution. The fitness of individuals in the actor buffer decreases exponentially with time, which can avoid premature convergence of the mean actor. For TD3, DBCEM-TD3 maintains a critic buffer with the same number of critics as the number of actors generated in each iteration, and each critic is trained independently by sampling from the shared replay buffer. In each iteration, each newly generated actor uses different critics to guide learning. This ensures more diverse behaviors among the learned actors, enabling richer experiences to be collected during the evaluation phase. We conduct experimental evaluations on five continuous control tasks provided by OpenAI Gym. DBCEM-TD3 outperforms CEM-TD3, TD3, and other classic off-policy reinforcement learning algorithms in terms of performance and sample efficiency. \ No newline at end of file diff --git a/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification b/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification new file mode 100644 index 0000000000..626dcd6632 --- /dev/null +++ b/data/2024/aaai/Double-Bounded Optimal Transport for Advanced Clustering and Classification @@ -0,0 +1 @@ +Optimal transport (OT) is attracting increasing attention in machine learning. It aims to transport a source distribution to a target one at minimal cost. In its vanilla form, the source and target distributions are predetermined, which contracts to the real-world case involving undetermined targets. In this paper, we propose Doubly Bounded Optimal Transport (DB-OT), which assumes that the target distribution is restricted within two boundaries instead of a fixed one, thus giving more freedom for the transport to find solutions. Based on the entropic regularization of DB-OT, three scaling-based algorithms are devised for calculating the optimal solution. We also show that our DB-OT is helpful for barycenter-based clustering, which can avoid the excessive concentration of samples in a single cluster. Then we further develop DB-OT techniques for long-tailed classification which is an emerging and open problem. We first propose a connection between OT and classification, that is, in the classification task, training involves optimizing the Inverse OT to learn the representations, while testing involves optimizing the OT for predictions. with this OT perspective, we first apply DB-OT to improve the loss, and the Balanced Softmax is shown as a special case. Then we apply DB-OT for inference in the testing process. Even with vanilla Softmax trained features, our experiments show that our method can achieve good results with our improved inference scheme in the testing stage. \ No newline at end of file diff --git a/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes b/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes new file mode 100644 index 0000000000..f3f9a13912 --- /dev/null +++ b/data/2024/aaai/Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes @@ -0,0 +1 @@ +Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. In this paper, we use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent perturbation of the spectrum of the neural network Gaussian process (NNGP) kernel, thus establishing a novel connection between the NNGP literature and the random matrix theory literature in the context of neural networks. Our analytical expressions allow us to explore the generalisation behavior of the corresponding kernel and GP regression. Furthermore, they offer a new interpretation of double-descent in terms of the discrepancy between the width-dependent empirical kernel and the width-independent NNGP kernel. \ No newline at end of file diff --git a/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning b/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning new file mode 100644 index 0000000000..053ce1a412 --- /dev/null +++ b/data/2024/aaai/Double-Layer Hybrid-Label Identification Feature Selection for Multi-View Multi-Label Learning @@ -0,0 +1 @@ +Multi-view multi-label feature selection aims to select informative features where the data are collected from multiple sources with multiple interdependent class labels. For fully exploiting multi-view information, most prior works mainly focus on the common part in the ideal circumstance. However, the inconsistent part hidden in each view, including noises and specific elements, may affect the quality of mapping between labels and feature representations. Meanwhile, ignoring the specific part might lead to a suboptimal result, as each label is supposed to possess specific characteristics of its own. To deal with the double problems in multi-view multi-label feature selection, we propose a unified loss function which is a totally splitting structure for observed labels as hybrid labels that is, common labels, view-to-all specific labels and noisy labels, and the view-to-all specific labels further splits into several specific labels of each view. The proposed method simultaneously considers the consistency and complementarity of different views. Through exploring the feature weights of hybrid labels, the mapping relationships between labels and features can be established sequentially based on their attributes. Additionally, the interrelatedness among hybrid labels is also investigated and injected into the function. Specific to the specific labels of each view, we construct the novel regularization paradigm incorporating logic operations. Finally, the convergence of the result is proved after applying the multiplicative update rules. Experiments on six datasets demonstrate the effectiveness and superiority of our method compared with the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Doubly Perturbed Task Free Continual Learning b/data/2024/aaai/Doubly Perturbed Task Free Continual Learning new file mode 100644 index 0000000000..88d8b1a326 --- /dev/null +++ b/data/2024/aaai/Doubly Perturbed Task Free Continual Learning @@ -0,0 +1 @@ +Task-free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students b/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students new file mode 100644 index 0000000000..1e9672ac97 --- /dev/null +++ b/data/2024/aaai/Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students @@ -0,0 +1 @@ +Artificial Intelligence (AI) is permeating almost every area of society, reshaping how many people, including youth, navigate the world. Despite the increased presence of AI, most people lack a baseline knowledge of how AI works. Moreover, social barriers often hinder equal access to AI courses, perpetuating disparities in participation in the field. To address this, it is crucial to design AI curricula that are effective, inclusive, and relevant, especially to learners from backgrounds that are historically excluded from working in tech. In this paper, we present AI for Wellbeing, a curriculum where students explore conversational AI and the ethical considerations around using it to promote wellbeing. We specifically designed content, educator materials, and educational technologies to meet the interests and needs of students and educators from diverse backgrounds. We piloted AI for Wellbeing in a 5-day virtual workshop with middle school teachers and students. Then, using a mixed-methods approach, we analyzed students' work and teachers' feedback. Our results suggest that the curriculum content and design effectively engaged students, enabling them to implement meaningful AI projects for wellbeing. We hope that the design of this curriculum and insights from our evaluation will inspire future efforts to create culturally relevant K-12 AI curricula. \ No newline at end of file diff --git a/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency b/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency new file mode 100644 index 0000000000..156085cc0a --- /dev/null +++ b/data/2024/aaai/DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency @@ -0,0 +1 @@ +The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognoses. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. \ No newline at end of file diff --git a/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation b/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation new file mode 100644 index 0000000000..a8f22d4beb --- /dev/null +++ b/data/2024/aaai/DreamIdentity: Enhanced Editability for Efficient Face-Identity Preserved Image Generation @@ -0,0 +1 @@ +While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity and follow the text prompts simultaneously for conditioned input face images and texts. Despite existing encoder-based methods achieving high efficiency and decent face similarity, the generated image often fails to follow the textual prompts. To ease this editability issue, we present DreamIdentity, to learn edit-friendly and accurate face-identity representations in the word embedding space. Specifically, we propose self-augmented editability learning to enhance the editability for projected embedding, which is achieved by constructing paired generated celebrity's face and edited celebrity images for training, aiming at transferring mature editability of off-the-shelf text-to-image models in celebrity to unseen identities. Furthermore, we design a novel dedicated face-identity encoder to learn an accurate representation of human faces, which applies multi-scale ID-aware features followed by a multi-embedding projector to generate the pseudo words in the text embedding space directly. Extensive experiments show that our method can generate more text-coherent and ID-preserved images with negligible time overhead compared to the standard text-to-image generation process. \ No newline at end of file diff --git a/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models b/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models new file mode 100644 index 0000000000..e5565de52a --- /dev/null +++ b/data/2024/aaai/DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models @@ -0,0 +1,7 @@ +Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. +However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description. +To this end, we introduce DreamStyle, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. +DreamStyle optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. +In addition, with content and style guidance, DreamStyle exhibits flexibility to accommodate a range of style references. +Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation. +Project page: https://nmhkahn.github.io/dreamstyler/ \ No newline at end of file diff --git a/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract) b/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract) new file mode 100644 index 0000000000..8bd072a81f --- /dev/null +++ b/data/2024/aaai/Dual Mapping of 2D StyleGAN for 3D-Aware Image Generation and Manipulation (Student Abstract) @@ -0,0 +1 @@ +3D-aware GANs successfully solve the problem of 3D-consistency generation and furthermore provide a 3D shape of the generated object. However, the application of the volume renderer disturbs the disentanglement of the latent space, which makes it difficult to manipulate 3D-aware GANs and lowers the image quality of style-based generators. In this work, we devise a dual-mapping framework to make the generated images of pretrained 2D StyleGAN consistent in 3D space. We utilize a tri-plane representation to estimate the 3D shape of the generated object and two mapping networks to bridge the latent space of StyleGAN and the 3D tri-plane space. Our method does not alter the parameters of the pretrained generator, which means the interpretability of latent space is preserved for various image manipulations. Experiments show that our method lifts the 3D awareness of pretrained 2D StyleGAN to 3D-aware GANs and outperforms the 3D-aware GANs in controllability and image quality. \ No newline at end of file diff --git a/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing b/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing new file mode 100644 index 0000000000..eb147e404f --- /dev/null +++ b/data/2024/aaai/Dual Self-Paced Cross-Modal Hashing @@ -0,0 +1 @@ +Cross-modal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To address this problem, we propose a novel Dual Self-Paced Cross-Modal Hashing (DSCMH) that mimics human cognitive learning to learn hashing from ``easy'' to ``hard'' in both instance and feature levels, thereby embracing robustness against noise/outliers. Specifically, our DSCMH assigns weights to each instance and feature to measure their difficulty or reliability, and then uses these weights to automatically filter out the noisy and irrelevant data points in the original space. By gradually increasing the weights during training, our method can focus on more instances and features from ``easy'' to ``hard'' in training, thus mitigating the adverse effects of noise or outliers. Extensive experiments are conducted on three widely-used benchmark datasets to demonstrate the effectiveness and robustness of the proposed DSCMH over 12 state-of-the-art CMH methods. \ No newline at end of file diff --git a/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks b/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks new file mode 100644 index 0000000000..b125f3f797 --- /dev/null +++ b/data/2024/aaai/Dual-Level Curriculum Meta-Learning for Noisy Few-Shot Learning Tasks @@ -0,0 +1,2 @@ +Few-shot learning (FSL) is essential in many practical applications. However, the limited training examples make the models more vulnerable to label noise, which can lead to poor generalization capability. To address this critical challenge, we propose a curriculum meta-learning model that employs a novel dual-level class-example sampling strategy to create a robust curriculum for adaptive task distribution formulation and robust model training. The dual-level framework proposes a heuristic class sampling criterion that measures pairwise class boundary complexity to form a class curriculum; it uses effective example sampling through an under-trained proxy model to form an example curriculum. By utilizing both class-level and example-level information, our approach is more robust to handle limited training data and noisy labels that commonly occur in few-shot learning tasks. +The model has efficient convergence behavior, which is verified through rigorous convergence analysis. Additionally, we establish a novel error bound through a hierarchical PAC-Bayesian analysis for curriculum meta-learning under noise. We conduct extensive experiments that demonstrate the effectiveness of our framework in outperforming existing noisy few-shot learning methods under various few-shot classification benchmarks. Our code is available at https://github.com/ritmininglab/DCML. \ No newline at end of file diff --git a/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection b/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection new file mode 100644 index 0000000000..f24a7f8c7d --- /dev/null +++ b/data/2024/aaai/Dual-Perspective Knowledge Enrichment for Semi-supervised 3D Object Detection @@ -0,0 +1 @@ +Semi-supervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared to its 2D counterpart due to the greater effort required to collect 3D data. Moreover, the loose consistency regularization in SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to either low-quality supervision or a limited amount of pseudo labels. To address these issues, we present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective. Specifically, from the data-perspective, we propose a class-probabilistic data augmentation method that augments the input data with additional instances based on the varying distribution of class probabilities. Our DPKE achieves feature-perspective knowledge enrichment by designing a geometry-aware feature matching method that regularizes feature-level similarity between object proposals from the student and teacher models. Extensive experiments on the two benchmark datasets demonstrate that our DPKE achieves superior performance over existing state-of-the-art approaches under various label ratio conditions. The source code and models will be made available to the public. \ No newline at end of file diff --git a/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection b/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection new file mode 100644 index 0000000000..00d8412856 --- /dev/null +++ b/data/2024/aaai/Dual-Prior Augmented Decoding Network for Long Tail Distribution in HOI Detection @@ -0,0 +1 @@ +Human object interaction detection aims at localizing human-object pairs and recognizing their interactions. Trapped by the long-tailed distribution of the data, existing HOI detection methods often have difficulty recognizing the tail categories. Many approaches try to improve the recognition of HOI tasks by utilizing external knowledge (e.g. pre-trained visual-language models). However, these approaches mainly utilize external knowledge at the HOI combination level and achieve limited improvement in the tail categories. In this paper, we propose a dual-prior augmented decoding network by decomposing the HOI task into two sub-tasks: human-object pair detection and interaction recognition. For each subtask, we leverage external knowledge to enhance the model's ability at a finer granularity. Specifically, we acquire the prior candidates from an external classifier and embed them to assist the subsequent decoding process. Thus, the long-tail problem is mitigated from a coarse-to-fine level with the corresponding external knowledge. Our approach outperforms existing state-of-the-art models in various settings and significantly boosts the performance on the tail HOI categories. The source code is available at https://github.com/PRIS-CV/DP-ADN. \ No newline at end of file diff --git a/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation b/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation new file mode 100644 index 0000000000..d031d2f649 --- /dev/null +++ b/data/2024/aaai/Dual-View Whitening on Pre-trained Text Embeddings for Sequential Recommendation @@ -0,0 +1 @@ +Recent advances in sequential recommendation models have demonstrated the efficacy of integrating pre-trained text embeddings with item ID embeddings to achieve superior performance. However, our study takes a unique perspective by exclusively focusing on the untapped potential of text embeddings, obviating the need for ID embeddings. We begin by implementing a pre-processing strategy known as whitening, which effectively transforms the anisotropic semantic space of pre-trained text embeddings into an isotropic Gaussian distribution. Comprehensive experiments reveal that applying whitening to pre-trained text embeddings in sequential recommendation models significantly enhances performance. Yet, a full whitening operation might break the potential manifold of items with similar text semantics. To retain the original semantics while benefiting from the isotropy of the whitened text features, we propose a Dual-view Whitening method for Sequential Recommendation (DWSRec), which leverages both fully whitened and relaxed whitened item representations as dual views for effective recommendations. We further examine the advantages of our approach through both empirical and theoretical analyses. Experiments on three public benchmark datasets show that DWSRec outperforms state-of-the-art methods for sequential recommendation. \ No newline at end of file diff --git a/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging b/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging new file mode 100644 index 0000000000..7c13d14a90 --- /dev/null +++ b/data/2024/aaai/Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging @@ -0,0 +1 @@ +Coded aperture snapshot spectral imaging (CASSI) system is an effective manner for hyperspectral snapshot compressive imaging. The core issue of CASSI is to solve the inverse problem for the reconstruction of hyperspectral image (HSI). In recent years, Transformer-based methods achieve promising performance in HSI reconstruction. However, capturing both long-range dependencies and local information while ensuring reasonable computational costs remains a challenging problem. In this paper, we propose a Transformer-based HSI reconstruction method called dual-window multiscale Transformer (DWMT), which is a coarse-to-fine process, reconstructing the global properties of HSI with the long-range dependencies. In our method, we propose a novel U-Net architecture using a dual-branch encoder to refine pixel information and full-scale skip connections to fuse different features, enhancing the extraction of fine-grained features. Meanwhile, we design a novel self-attention mechanism called dual-window multiscale multi-head self-attention (DWM-MSA), which utilizes two different-sized windows to compute self-attention, which can capture the long-range dependencies in a local region at different scales to improve the reconstruction performance. We also propose a novel position embedding method for Transformer, named con-abs position embedding (CAPE), which effectively enhances positional information of the HSIs. Extensive experiments on both the simulated and the real data are conducted to demonstrate the superior performance, stability, and generalization ability of our DWMT. Code of this project is at https://github.com/chenx2000/DWMT. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions b/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions new file mode 100644 index 0000000000..766b45d9a0 --- /dev/null +++ b/data/2024/aaai/Dynamic Budget Throttling in Repeated Second-Price Auctions @@ -0,0 +1,10 @@ +In today's online advertising markets, a crucial requirement for an advertiser is to control her total expenditure within a time horizon under some budget. +Among various budget control methods, throttling has emerged as a popular choice, managing an advertiser's total expenditure by selecting only a subset of auctions to participate in. +This paper provides a theoretical panorama of a single advertiser's dynamic budget throttling process in repeated second-price auctions. +We first establish a lower bound on the regret and an upper bound on the asymptotic competitive ratio for any throttling algorithm, respectively, when the advertiser's values are stochastic and adversarial. +Regarding the algorithmic side, we propose the OGD-CB algorithm, which guarantees a near-optimal expected regret with stochastic values. +On the other hand, when values are adversarial, we prove that this algorithm also reaches the upper bound on the asymptotic competitive ratio. +We further compare throttling with pacing, another widely adopted budget control method, in repeated second-price auctions. +In the stochastic case, we demonstrate that pacing is generally superior to throttling for the advertiser, supporting the well-known result that pacing is asymptotically optimal in this scenario. +However, in the adversarial case, we give an exciting result indicating that throttling is also an asymptotically optimal dynamic bidding strategy. +Our results bridge the gaps in theoretical research of throttling in repeated auctions and comprehensively reveal the ability of this popular budget-smoothing strategy. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification b/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification new file mode 100644 index 0000000000..cdaab6c4cb --- /dev/null +++ b/data/2024/aaai/Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification @@ -0,0 +1 @@ +Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents b/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents new file mode 100644 index 0000000000..dcb6075fb8 --- /dev/null +++ b/data/2024/aaai/Dynamic Knowledge Injection for AIXI Agents @@ -0,0 +1 @@ +Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI's Bayesian environment model using an a-priori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called DynamicHedgeAIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a time-adaptive prior constructed from a variant of the Hedge algorithm. The DynamicHedgeAIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent's practical utility. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network b/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network new file mode 100644 index 0000000000..33d3794583 --- /dev/null +++ b/data/2024/aaai/Dynamic Reactive Spiking Graph Neural Network @@ -0,0 +1 @@ +Spiking Graph Neural Networks are emerging tools for analyzing graph data along with low energy consumption and certain biological fidelity. Existing methods directly integrate same-reactive spiking neurons into graph neural networks for processing propagated graphs. However, such same-reactive neurons are not biological-functionality enough compared to the brain's dynamic-reactive ones, limiting the model's expression. Meanwhile, insufficient long-range neighbor information can be excavated with the few-step propagated graph, restricting discrimination of graph spiking embeddings. Inspired by the dynamic cognition in the brain, we propose a Dynamic Reactive Spiking Graph Neural Network that can enhance model's expressive ability in higher biological fidelity. Specifically, we design dynamic reactive spiking neurons to process spiking graph inputs, which have unique optimizable thresholds to spontaneously explore dynamic reactive states between neurons. Moreover, discriminative graph positional spikes are learned and integrated adaptively into spiking outputs through our neurons, thereby exploring long-range neighbors more thoroughly. Finally, with the dynamic reactive mechanism and learnable positional integration, we can obtain a powerful and highly bio-fidelity model with low energy consumption. Experiments on various domain-related datasets can demonstrate the effectiveness of our model. Our code is available at https://github.com/hzhao98/DRSGNN. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation b/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation new file mode 100644 index 0000000000..e7130fb6f7 --- /dev/null +++ b/data/2024/aaai/Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation @@ -0,0 +1 @@ +We study reinforcement learning (RL) in episodic MDPs with adversarial full-information losses and the unknown transition. Instead of the classical static regret, we adopt dynamic regret as the performance measure which benchmarks the learner's performance with changing policies, making it more suitable for non-stationary environments. The primary challenge is to handle the uncertainties of unknown transition and unknown non-stationarity of environments simultaneously. We propose a general framework to decouple the two sources of uncertainties and show the dynamic regret bound naturally decomposes into two terms, one due to constructing confidence sets to handle the unknown transition and the other due to choosing sub-optimal policies under the unknown non-stationarity. To this end, we first employ the two-layer online ensemble structure to handle the adaptation error due to the unknown non-stationarity, which is model-agnostic. Subsequently, we instantiate the framework to three fundamental MDP models, including tabular MDPs, linear MDPs and linear mixture MDPs, and present corresponding approaches to control the exploration error due to the unknown transition. We provide dynamic regret guarantees respectively and show they are optimal in terms of the number of episodes K and the non-stationarity P̄ᴋ by establishing matching lower bounds. To the best of our knowledge, this is the first work that achieves the dynamic regret exhibiting optimal dependence on K and P̄ᴋ without prior knowledge about the non-stationarity for adversarial MDPs with unknown transition. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition b/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition new file mode 100644 index 0000000000..c675b88e36 --- /dev/null +++ b/data/2024/aaai/Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition @@ -0,0 +1 @@ +Graph convolutional networks (GCNs) have attracted great attention and achieved remarkable performance in skeleton-based action recognition. However, most of the previous works are designed to refine skeleton topology without considering the types of different joints and edges, making them infeasible to represent the semantic information. In this paper, we proposed a dynamic semantic-based graph convolution network (DS-GCN) for skeleton-based human action recognition, where the joints and edge types were encoded in the skeleton topology in an implicit way. Specifically, two semantic modules, the joints type-aware adaptive topology and the edge type-aware adaptive topology, were proposed. Combining proposed semantics modules with temporal convolution, a powerful framework named DS-GCN was developed for skeleton-based action recognition. Extensive experiments in two datasets, NTU-RGB+D and Kinetics-400 show that the proposed semantic modules were generalized enough to be utilized in various backbones for boosting recognition accuracy. Meanwhile, the proposed DS-GCN notably outperformed state-of-the-art methods. The code is released here https://github.com/davelailai/DS-GCN \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Spiking Graph Neural Networks b/data/2024/aaai/Dynamic Spiking Graph Neural Networks new file mode 100644 index 0000000000..bf442c7949 --- /dev/null +++ b/data/2024/aaai/Dynamic Spiking Graph Neural Networks @@ -0,0 +1 @@ +The integration of Spiking Neural Networks (SNNs) and Graph Neural Networks (GNNs) is gradually attracting attention due to the low power consumption and high efficiency in processing the non-Euclidean data represented by graphs. However, as a common problem, dynamic graph representation learning faces challenges such as high complexity and large memory overheads. Current work often uses SNNs instead of Recurrent Neural Networks (RNNs) by using binary features instead of continuous ones for efficient training, which would overlooks graph structure information and leads to the loss of details during propagation. Additionally, optimizing dynamic spiking models typically requires propagation of information across time steps, which increases memory requirements. To address these challenges, we present a framework named \underline{Dy}namic \underline{S}p\underline{i}king \underline{G}raph \underline{N}eural Networks (\method{}). To mitigate the information loss problem, \method{} propagates early-layer information directly to the last layer for information compensation. To accommodate the memory requirements, we apply the implicit differentiation on the equilibrium state, which does not rely on the exact reverse of the forward computation. While traditional implicit differentiation methods are usually used for static situations, \method{} extends it to the dynamic graph setting. Extensive experiments on three large-scale real-world dynamic graph datasets validate the effectiveness of \method{} on dynamic node classification tasks with lower computational costs. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning b/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning new file mode 100644 index 0000000000..2629a6f68e --- /dev/null +++ b/data/2024/aaai/Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning @@ -0,0 +1 @@ +Continual learning (CL) has shown promising results and comparable performance to learning at once in a fully supervised manner. However, CL strategies typically require a large number of labeled samples, making their real-life deployment challenging. In this work, we focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories. We provide a comprehensive analysis of SSCL and demonstrate that unreliable distributions of unlabeled data lead to unstable training and refinement of the progressing stages. This problem severely impacts the performance of SSCL. To address the limitations, we propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning, which leverages both semantic and structural information to achieve more stable knowledge distillation on unlabeled data and exhibit robustness against distribution bias. Firstly, we formalize a general model of structural distillation and design a dynamic graph construction for the continual learning progress. Next, we define a structure distillation vector and design a dynamic sub-graph distillation algorithm, which enables end-to-end training and adaptability to scale up tasks. The entire proposed method is adaptable to various CL methods and supervision settings. Finally, experiments conducted on three datasets CIFAR10, CIFAR100, and ImageNet-100, with varying supervision ratios, demonstrate the effectiveness of our proposed approach in mitigating the catastrophic forgetting problem in semi-supervised continual learning scenarios. Our code is available: https://github.com/fanyan0411/DSGD. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces b/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces new file mode 100644 index 0000000000..7c166a04ff --- /dev/null +++ b/data/2024/aaai/Dynamic Tangled Derivative Logic of Metric Spaces @@ -0,0 +1 @@ +Dynamical systems are abstract models of interaction between space and time. They are often used in fields such as physics and engineering to understand complex processes, but due to their general nature, they have found applications for studying computational processes, interaction in multi-agent systems, machine learning algorithms and other computer science related phenomena. In the vast majority of applications, a dynamical system consists of the action of a continuous `transition function' on a metric space. In this work, we consider decidable formal systems for reasoning about such structures. Spatial logics can be traced back to the 1940's, but our work follows a more dynamic turn that these logics have taken due to two recent developments: the study of the topological mu-calculus, and the the integration of linear temporal logic with logics based on the Cantor derivative. In this paper, we combine dynamic topological logics based on the Cantor derivative and the `next point in time' operators with an expressively complete fixed point operator to produce a combination of the topological mu-calculus with linear temporal logic. We show that the resulting logics are decidable and have a natural axiomatisation. Moreover, we prove that these logics are complete for interpretations on the Cantor space, the rational numbers, and subspaces thereof. \ No newline at end of file diff --git a/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval b/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval new file mode 100644 index 0000000000..b47d9fe610 --- /dev/null +++ b/data/2024/aaai/Dynamic Weighted Combiner for Mixed-Modal Image Retrieval @@ -0,0 +1 @@ +Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to implicitly improve noisy supervision. Finally, to bridge modality gaps and facilitate similarity learning, we propose a CLIP-based mutual enhancement module alternately trained by a mixed-modality contrastive loss. Extensive experiments verify that our proposed model significantly outperforms state-of-the-art methods on real-world datasets. The source code is available at https://github.com/fuxianghuang1/DWC. \ No newline at end of file diff --git a/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning b/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning new file mode 100644 index 0000000000..38d88acf24 --- /dev/null +++ b/data/2024/aaai/E2E-AT: A Unified Framework for Tackling Uncertainty in Task-Aware End-to-End Learning @@ -0,0 +1 @@ +Successful machine learning involves a complete pipeline of data, model, and downstream applications. Instead of treating them separately, there has been a prominent increase of attention within the constrained optimization (CO) and machine learning (ML) communities towards combining prediction and optimization models. The so-called end-to-end (E2E) learning captures the task-based objective for which they will be used for decision making. Although a large variety of E2E algorithms have been presented, it has not been fully investigated how to systematically address uncertainties involved in such models. Most of the existing work considers the uncertainties of ML in the input space and improves robustness through adversarial training. We extend this idea to E2E learning and prove that there is a robustness certification procedure by solving augmented integer programming. Furthermore, we show that neglecting the uncertainty of COs during training causes a new trigger for generalization errors. To include all these components, we propose a unified framework that covers the uncertainties emerging in both the input feature space of the ML models and the COs. The framework is described as a robust optimization problem and is practically solved via end-to-end adversarial training (E2E-AT). Finally, the performance of E2E-AT is evaluated by a real-world end-to-end power system operation problem, including load forecasting and sequential scheduling tasks. \ No newline at end of file diff --git a/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning b/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning new file mode 100644 index 0000000000..3196808f0c --- /dev/null +++ b/data/2024/aaai/E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning @@ -0,0 +1 @@ +The bio-inspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose E2HQV, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40% for some evaluation metrics. \ No newline at end of file diff --git a/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks b/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks new file mode 100644 index 0000000000..660f4a24a5 --- /dev/null +++ b/data/2024/aaai/EAN: An Efficient Attention Module Guided by Normalization for Deep Neural Networks @@ -0,0 +1,2 @@ +Deep neural networks (DNNs) have achieved remarkable success in various fields, and two powerful techniques, feature normalization and attention mechanisms, have been widely used to enhance model performance. However, they are usually considered as two separate approaches or combined in a simplistic manner. +In this paper, we investigate the intrinsic relationship between feature normalization and attention mechanisms and propose an Efficient Attention module guided by Normalization, dubbed EAN. Instead of using costly fully-connected layers for attention learning, EAN leverages the strengths of feature normalization and incorporates an Attention Generation (AG) unit to re-calibrate features. The proposed AG unit exploits the normalization component as a measure of the importance of distinct features and generates an attention mask using GroupNorm, L2 Norm, and Adaptation operations. By employing a grouping, AG unit and aggregation strategy, EAN is established, offering a unified module that harnesses the advantages of both normalization and attention, while maintaining minimal computational overhead. Furthermore, EAN serves as a plug-and-play module that can be seamlessly integrated with classic backbone architectures. Extensive quantitative evaluations on various visual tasks demonstrate that EAN achieves highly competitive performance compared to the current state-of-the-art attention methods while sustaining lower model complexity. \ No newline at end of file diff --git a/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection b/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection new file mode 100644 index 0000000000..d410e97a2b --- /dev/null +++ b/data/2024/aaai/EAT: Towards Long-Tailed Out-of-Distribution Detection @@ -0,0 +1 @@ +Despite recent advancements in out-of-distribution (OOD) detection, most current studies assume a class-balanced in-distribution training dataset, which is rarely the case in real-world scenarios. This paper addresses the challenging task of long-tailed OOD detection, where the in-distribution data follows a long-tailed class distribution. The main difficulty lies in distinguishing OOD data from samples belonging to the tail classes, as the ability of a classifier to detect OOD instances is not strongly correlated with its accuracy on the in-distribution classes. To overcome this issue, we propose two simple ideas: (1) Expanding the in-distribution class space by introducing multiple abstention classes. This approach allows us to build a detector with clear decision boundaries by training on OOD data using virtual labels. (2) Augmenting the context-limited tail classes by overlaying images onto the context-rich OOD data. This technique encourages the model to pay more attention to the discriminative features of the tail classes. We provide a clue for separating in-distribution and OOD data by analyzing gradient noise. Through extensive experiments, we demonstrate that our method outperforms the current state-of-the-art on various benchmark datasets. Moreover, our method can be used as an add-on for existing long-tail learning approaches, significantly enhancing their OOD detection performance. Code is available at: https://github.com/Stomach-ache/Long-Tailed-OOD-Detection. \ No newline at end of file diff --git a/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction b/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction new file mode 100644 index 0000000000..5006cb8eb4 --- /dev/null +++ b/data/2024/aaai/ECHO-GL: Earnings Calls-Driven Heterogeneous Graph Learning for Stock Movement Prediction @@ -0,0 +1 @@ +Stock movement prediction serves an important role in quantitative trading. Despite advances in existing models that enhance stock movement prediction by incorporating stock relations, these prediction models face two limitations, i.e., constructing either insufficient or static stock relations, which fail to effectively capture the complex dynamic stock relations because such complex dynamic stock relations are influenced by various factors in the ever-changing financial market. To tackle the above limitations, we propose a novel stock movement prediction model ECHO-GL based on stock relations derived from earnings calls. ECHO-GL not only constructs comprehensive stock relations by exploiting the rich semantic information in the earnings calls but also captures the movement signals between related stocks based on multimodal and heterogeneous graph learning. Moreover, ECHO-GL customizes learnable stock stochastic processes based on the post earnings announcement drift (PEAD) phenomenon to generate the temporal stock price trajectory, which can be easily plugged into any investment strategy with different time horizons to meet investment demands. Extensive experiments on two financial datasets demonstrate the effectiveness of ECHO-GL on stock price movement prediction tasks together with high prediction accuracy and trading profitability. \ No newline at end of file diff --git a/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction b/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction new file mode 100644 index 0000000000..0e85290ec1 --- /dev/null +++ b/data/2024/aaai/EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction @@ -0,0 +1,8 @@ +Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors. +Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: prediction-based and anchor-based matching. +The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability. +In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models. +We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity. +Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance. +Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset. +Appendix and code are available at https://github.com/Longzhong-Lin/EDA. \ No newline at end of file diff --git a/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration b/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration new file mode 100644 index 0000000000..e0a808dacc --- /dev/null +++ b/data/2024/aaai/EG-NAS: Neural Architecture Search with Fast Evolutionary Exploration @@ -0,0 +1 @@ +Differentiable Architecture Search (DARTS) has achieved a rapid search for excellent architectures by optimizing architecture parameters through gradient descent. However, this efficiency comes with a significant challenge: the risk of premature convergence to local optima, resulting in subpar performance that falls short of expectations. To address this issue, we propose a novel and effective method called Evolutionary Gradient-Based Neural Architecture Search (EG-NAS). Our approach combines the strengths of both gradient descent and evolutionary strategy, allowing for the exploration of various optimization directions during the architecture search process. To begin with, we continue to employ gradient descent for updating network parameters to ensure efficiency. Subsequently, to mitigate the risk of premature convergence, we introduce an evolutionary strategy with global search capabilities to optimize the architecture parameters. By leveraging the best of both worlds, our method strikes a balance between efficient exploration and exploitation of the search space. Moreover, we have redefined the fitness function to not only consider accuracy but also account for individual similarity. This inclusion enhances the diversity and accuracy of the optimized directions identified by the evolutionary strategy. Extensive experiments on various datasets and search spaces demonstrate that EG-NAS achieves highly competitive performance at significantly low search costs compared to state-of-the-art methods. The code is available at https://github.com/caicaicheng/EG-NAS. \ No newline at end of file diff --git a/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning b/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning new file mode 100644 index 0000000000..68c3d574dd --- /dev/null +++ b/data/2024/aaai/EMGAN: Early-Mix-GAN on Extracting Server-Side Model in Split Federated Learning @@ -0,0 +1 @@ +Split Federated Learning (SFL) is an emerging edge-friendly version of Federated Learning (FL), where clients process a small portion of the entire model. While SFL was considered to be resistant to Model Extraction Attack (MEA) by design, a recent work shows it is not necessarily the case. In general, gradient-based MEAs are not effective on a target model that is changing, as is the case in training-from-scratch applications. In this work, we propose a strong MEA during the SFL training phase. The proposed Early-Mix-GAN (EMGAN) attack effectively exploits gradient queries regardless of data assumptions. EMGAN adopts three key components to address the problem of inconsistent gradients. Specifically, it employs (i) Early-learner approach for better adaptability, (ii) Multi-GAN approach to introduce randomness in generator training to mitigate mode collapse, and (iii) ProperMix to effectively augment the limited amount of synthetic data for a better approximation of the target domain data distribution. EMGAN achieves excellent results in extracting server-side models. With only 50 training samples, EMGAN successfully extracts a 5-layer server-side model of VGG-11 on CIFAR-10, with 7% less accuracy than the target model. With zero training data, the extracted model achieves 81.3% accuracy, which is significantly better than the 45.5% accuracy of the model extracted by the SoTA method. The code is available at "https://github.com/zlijingtao/SFL-MEA". \ No newline at end of file diff --git a/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression b/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression new file mode 100644 index 0000000000..26d266096e --- /dev/null +++ b/data/2024/aaai/EPSD: Early Pruning with Self-Distillation for Efficient Model Compression @@ -0,0 +1 @@ +Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques. \ No newline at end of file diff --git a/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation b/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation new file mode 100644 index 0000000000..af4db615ad --- /dev/null +++ b/data/2024/aaai/ERL-TD: Evolutionary Reinforcement Learning Enhanced with Truncated Variance and Distillation Mutation @@ -0,0 +1 @@ +Recently, an emerging research direction called Evolutionary Reinforcement Learning (ERL) has been proposed, which combines evolutionary algorithm with reinforcement learning (RL) for tackling the tasks of sequential decision making. However, the recently proposed ERL algorithms often suffer from two challenges: the inaccuracy of policy estimation caused by the overestimation bias in RL and the insufficiency of exploration caused by inefficient mutations. To alleviate these problems, we propose an Evolutionary Reinforcement Learning algorithm enhanced with Truncated variance and Distillation mutation, called ERL-TD. We utilize multiple Q-networks to evaluate state-action pairs, so that multiple networks can provide more accurate evaluations for state-action pairs, in which the variance of evaluations can be adopted to control the overestimation bias in RL. Moreover, we propose a new distillation mutation to provide a promising mutation direction, which is different from traditional mutation generating a large number of random solutions. We evaluate ERL-TD on the continuous control benchmarks from the OpenAI Gym and DeepMind Control Suite. The experiments show that ERL-TD shows excellent performance and outperforms all baseline RL algorithms on the test suites. \ No newline at end of file diff --git a/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service b/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service new file mode 100644 index 0000000000..b2c0566650 --- /dev/null +++ b/data/2024/aaai/ESG Accountability Made Easy: DocQA at Your Service @@ -0,0 +1 @@ +We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io. \ No newline at end of file diff --git a/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation b/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation new file mode 100644 index 0000000000..a87631f16d --- /dev/null +++ b/data/2024/aaai/ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation @@ -0,0 +1 @@ +Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (e.g., BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (e.g., a vocabulary) and a long action sequence (e.g., a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/examples/esrl. \ No newline at end of file diff --git a/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations b/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations new file mode 100644 index 0000000000..b4eb63e688 --- /dev/null +++ b/data/2024/aaai/ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations @@ -0,0 +1 @@ +Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference and journal papers. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, to discover and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents, and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation). \ No newline at end of file diff --git a/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE b/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE new file mode 100644 index 0000000000..24f484f8bb --- /dev/null +++ b/data/2024/aaai/EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE @@ -0,0 +1 @@ +Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval. \ No newline at end of file diff --git a/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing b/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing new file mode 100644 index 0000000000..327808490c --- /dev/null +++ b/data/2024/aaai/Early Detection of Extreme Storm Tide Events Using Multimodal Data Processing @@ -0,0 +1 @@ +Sea-level rise is a well-known consequence of climate change. Several studies have estimated the social and economic impact of the increase in extreme flooding. An efficient way to mitigate its consequences is the development of a flood alert and prediction system, based on high-resolution numerical models and robust sensing networks. However, current models use various simplifying assumptions that compromise accuracy to ensure solvability within a reasonable timeframe, hindering more regular and cost-effective forecasts for various locations along the shoreline. To address these issues, this work proposes a hybrid model for multimodal data processing that combines physics-based numerical simulations, data obtained from a network of sensors, and satellite images to provide refined wave and sea-surface height forecasts, with real results obtained in a critical location within the Port of Santos (the largest port in Latin America). Our approach exhibits faster convergence than data-driven models while achieving more accurate predictions. Moreover, the model handles irregularly sampled time series and missing data without the need for complex preprocessing mechanisms or data imputation while keeping low computational costs through a combination of time encoding, recurrent and graph neural networks. Enabling raw sensor data to be easily combined with existing physics-based models opens up new possibilities for accurate extreme storm tide events forecast systems that enhance community safety and aid policymakers in their decision-making processes. \ No newline at end of file diff --git a/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading b/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading new file mode 100644 index 0000000000..157ccdd66e --- /dev/null +++ b/data/2024/aaai/EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading @@ -0,0 +1 @@ +High-frequency trading (HFT) is using computer algorithms to make trading decisions in short time scales (e.g., second-level), which is widely used in the Cryptocurrency (Crypto) market, (e.g., Bitcoin). Reinforcement learning (RL) in financial research has shown stellar performance on many quantitative trading tasks. However, most methods focus on low-frequency trading, e.g., day-level, which cannot be directly applied to HFT because of two challenges. First, RL for HFT involves dealing with extremely long trajectories (e.g., 2.4 million steps per month), which is hard to optimize and evaluate. Second, the dramatic price fluctuations and market trend changes of Crypto make existing algorithms fail to maintain satisfactory performances. To tackle these challenges, we propose an Efficient hieArchical Reinforcement learNing method for High Frequency Trading (EarnHFT), a novel three-stage hierarchical RL framework for HFT. In stage I, we compute a Q-teacher, i.e., the optimal action value based on dynamic programming, for enhancing the performance and training efficiency of second level RL agents. In stage II, we construct a pool of diverse RL agents for different market trends, distinguished by return rates, where hundreds of RL agents are trained with different preferences of return rates and only a tiny fraction of them will be selected into the pool based on their profitability. In stage III, we train a minute-level router which dynamically picks a second-level agent from the pool to achieve stable performance across different markets. Through extensive experiments in various market trends on Crypto markets in a high-fidelity simulation trading environment, we demonstrate that EarnHFT significantly outperforms 6 state-of-art baselines in 6 popular financial criteria, exceeding the runner-up by 30% in profitability. \ No newline at end of file diff --git a/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering b/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering new file mode 100644 index 0000000000..d5fdf6cafa --- /dev/null +++ b/data/2024/aaai/EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering @@ -0,0 +1 @@ +Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA. \ No newline at end of file diff --git a/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model b/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model new file mode 100644 index 0000000000..98bd966d0a --- /dev/null +++ b/data/2024/aaai/Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model @@ -0,0 +1 @@ +Efficiently modeling spatio-temporal (ST) physical processes and observations presents a challenging problem for the deep learning community. Many recent studies have concentrated on meticulously reconciling various advantages, leading to designed models that are neither simple nor practical. To address this issue, this paper presents a systematic study on existing shortcomings faced by off-the-shelf models, including lack of local fidelity, poor prediction performance over long time-steps, low scalability, and inefficiency. To systematically address the aforementioned problems, we propose an EarthFarseer, a concise framework that combines parallel local convolutions and global Fourier-based transformer architectures, enabling dynamically capture the local-global spatial interactions and dependencies. EarthFarseer also incorporates a multi-scale fully convolutional and Fourier architectures to efficiently and effectively capture the temporal evolution. Our proposal demonstrates strong adaptability across various tasks and datasets, with fast convergence and better local fidelity in long time-steps predictions. Extensive experiments and visualizations over eight human society physical and natural physical datasets demonstrates the state-of-the-art performance of EarthFarseer. We release our code at https://github.com/easylearningscores/EarthFarseer. \ No newline at end of file diff --git a/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting b/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting new file mode 100644 index 0000000000..73d2441ff0 --- /dev/null +++ b/data/2024/aaai/EasyTS: The Express Lane to Long Time Series Forecasting @@ -0,0 +1 @@ +Responding to the escalating interest in long-term forecasting within the industry, we introduce EasyTS, a comprehensive toolkit engineered to streamline data collection, analysis, and model creation procedures. EasyTS acts as a unified solution, driving progress in long-term time series forecasting. The platform provides effortless access to various time series datasets, including a newly open-sourced multi-scenario dataset in the electricity domain. Integrated visualization and analysis tools help unveil inherent data features and relationships. EasyTS facilitates a user-friendly model validation approach with versatile evaluation criteria. This toolkit allows researchers to compare their models proficiently against renowned benchmarks. With our ongoing commitment to expanding our dataset collection and enhancing toolkit functionalities, we aspire to contribute significantly to the time series forecasting domain. Code is available at this repository: https://github.com/EdgeBigBang/EasyTS.git. \ No newline at end of file diff --git a/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce b/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce new file mode 100644 index 0000000000..d78385219d --- /dev/null +++ b/data/2024/aaai/EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce @@ -0,0 +1,2 @@ +Recently, instruction-following Large Language Models (LLMs) , represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT +with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks. The EcomGPT will be public at https://github.com/Alibaba-NLP/EcomGPT. \ No newline at end of file diff --git a/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings b/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings new file mode 100644 index 0000000000..9fa9fc0b1c --- /dev/null +++ b/data/2024/aaai/Editing Language Model-Based Knowledge Graph Embeddings @@ -0,0 +1 @@ +Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets will be available at https://github.com/AnonymousForPapers/DeltaKG. \ No newline at end of file diff --git a/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches b/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches new file mode 100644 index 0000000000..4fe3ca5a06 --- /dev/null +++ b/data/2024/aaai/Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches @@ -0,0 +1 @@ +The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches. \ No newline at end of file diff --git a/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model b/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model new file mode 100644 index 0000000000..fbbb5987b1 --- /dev/null +++ b/data/2024/aaai/Effective Causal Discovery under Identifiable Heteroscedastic Noise Model @@ -0,0 +1 @@ +Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the heteroscedastic noise issue, we introduce relaxed implementable sufficient conditions and prove the identifiability of a general class of SEM subject to those conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning which accounts for the noise variance variation across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties and learn a causal DAG from data with heteroscedastic variables noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data. \ No newline at end of file diff --git a/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation b/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation new file mode 100644 index 0000000000..ace2019365 --- /dev/null +++ b/data/2024/aaai/Effective Comparative Prototype Hashing for Unsupervised Domain Adaptation @@ -0,0 +1 @@ +Unsupervised domain adaptive hashing is a highly promising research direction within the field of retrieval. It aims to transfer valuable insights from the source domain to the target domain while maintaining high storage and retrieval efficiency. Despite its potential, this field remains relatively unexplored. Previous methods usually lead to unsatisfactory retrieval performance, as they frequently directly apply slightly modified domain adaptation algorithms to hash learning framework, or pursue domain alignment within the Hamming space characterized by limited semantic information. In this paper, we propose a simple yet effective approach named Comparative Prototype Hashing (CPH) for unsupervised domain adaptive image retrieval. We establish a domain-shared unit hypersphere space through prototype contrastive learning and then obtain the Hamming hypersphere space via mapping from the shared hypersphere. This strategy achieves a cohesive synergy between learning uniformly distributed and category conflict-averse feature representations, eliminating domain discrepancies, and facilitating hash code learning. Moreover, by leveraging dual-domain information to supervise the entire hashing model training process, we can generate hash codes that retain inter-sample similarity relationships within both domains. Experimental results validate that our CPH significantly outperforms the state-of-the-art counterparts across multiple cross-domain and single-domain retrieval tasks. Notably, on Office-Home and Office-31 datasets, CPH achieves an average performance improvement of 19.29% and 13.85% on cross-domain retrieval tasks compared to the second-best results, respectively. The source codes of our method are available at: https://github.com/christinecui/CPH. \ No newline at end of file diff --git a/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract) b/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract) new file mode 100644 index 0000000000..da74b1ec2a --- /dev/null +++ b/data/2024/aaai/Effective Data Distillation for Tabular Datasets (Student Abstract) @@ -0,0 +1 @@ +Data distillation is a technique of reducing a large dataset into a smaller dataset. The smaller dataset can then be used to train a model which can perform comparably to a model trained on the full dataset. Past works have examined this approach for image datasets, focusing on neural networks as target models. However, tabular datasets pose new challenges not seen in images. A sample in tabular dataset is a one dimensional vector unlike the two (or three) dimensional pixel grid of images, and Non-NN models such as XGBoost can often outperform neural network (NN) based models. Our contribution in this work is two-fold: 1) We show in our work that data distillation methods from images do not translate directly to tabular data; 2) We propose a new distillation method that consistently outperforms the baseline for multiple different models, including non-NN models such as XGBoost. \ No newline at end of file diff --git a/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference b/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference new file mode 100644 index 0000000000..64e23c9ec6 --- /dev/null +++ b/data/2024/aaai/Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference @@ -0,0 +1 @@ +In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constant-stepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors b/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors new file mode 100644 index 0000000000..086c724956 --- /dev/null +++ b/data/2024/aaai/Efficient Algorithms for Non-gaussian Single Index Models with Generative Priors @@ -0,0 +1 @@ +In this work, we focus on high-dimensional single index models with non-Gaussian sensing vectors and generative priors. More specifically, our goal is to estimate the underlying signal from i.i.d. realizations of the semi-parameterized single index model, where the underlying signal is contained in (up to a constant scaling) the range of a Lipschitz continuous generative model with bounded low-dimensional inputs, the sensing vector follows a non-Gaussian distribution, the noise is a random variable that is independent of the sensing vector, and the unknown non-linear link function is differentiable. Using the first- and second-order Stein's identity, we introduce efficient algorithms to obtain estimated vectors that achieve the near-optimal statistical rate. Experimental results on image datasets are provided to support our theory. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction b/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction new file mode 100644 index 0000000000..9ef9620fea --- /dev/null +++ b/data/2024/aaai/Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction @@ -0,0 +1 @@ +Asynchronous federated learning (AFL) is a distributed machine learning technique that allows multiple devices to collaboratively train deep learning models without sharing local data. However, AFL suffers from low efficiency due to poor client model training quality and slow server model convergence speed, which are a result of the heterogeneous nature of both data and devices. To address these issues, we propose Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and Fine-Grained Correction (FedAC). Our framework consists of three key components. The first component is client weight evaluation based on temporal gradient, which evaluates the client weight based on the similarity between the client and server update directions. The second component is adaptive server update with prospective weighted momentum, which uses an asynchronous buffered update strategy and a prospective weighted momentum with adaptive learning rate to update the global model in server. The last component is client update with fine-grained gradient correction, which introduces a fine-grained gradient correction term to mitigate the client drift and correct the client stochastic gradient. We conduct experiments on real and synthetic datasets, and compare with existing federated learning methods. Experimental results demonstrate effective improvements in model training efficiency and AFL performance by our framework. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis b/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis new file mode 100644 index 0000000000..eb6373e434 --- /dev/null +++ b/data/2024/aaai/Efficient Axiomatization of OWL 2 EL Ontologies from Data by Means of Formal Concept Analysis @@ -0,0 +1 @@ +We present an FCA-based axiomatization method that produces a complete EL TBox (the terminological part of an OWL 2 EL ontology) from a graph dataset in at most exponential time. We describe technical details that allow for efficient implementation as well as variations that dispense with the computation of extremely large axioms, thereby rendering the approach applicable albeit some completeness is lost. Moreover, we evaluate the prototype on real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution b/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution new file mode 100644 index 0000000000..e99826723a --- /dev/null +++ b/data/2024/aaai/Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution @@ -0,0 +1 @@ +Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems b/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems new file mode 100644 index 0000000000..84a4636f8e --- /dev/null +++ b/data/2024/aaai/Efficient Constraint Generation for Stochastic Shortest Path Problems @@ -0,0 +1 @@ +Current methods for solving Stochastic Shortest Path Problems (SSPs) find states’ costs-to-go by applying Bellman backups, where state-of-the-art methods employ heuristics to select states to back up and prune. A fundamental limitation of these algorithms is their need to compute the cost-to-go for every applicable action during each state backup, leading to unnecessary computation for actions identified as sub-optimal. We present new connections between planning and operations research and, using this framework, we address this issue of unnecessary computation by introducing an efficient version of constraint generation for SSPs. This technique allows algorithms to ignore sub-optimal actions and avoid computing their costs-to-go. We also apply our novel technique to iLAO* resulting in a new algorithm, CG-iLAO*. Our experiments show that CG-iLAO* ignores up to 57% of iLAO*’s actions and it solves problems up to 8x and 3x faster than LRTDP and iLAO*. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation b/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation new file mode 100644 index 0000000000..d3a171d556 --- /dev/null +++ b/data/2024/aaai/Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation @@ -0,0 +1,6 @@ +The Mixture-of-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects. +However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability. +In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. +The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead. +We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks. +The conducted experiments on the multi-deweather task show that our MoFME outperforms the state-of-the-art in the image restoration quality by 0.1-0.2 dB while saving more than 74% of parameters and 20% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles b/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles new file mode 100644 index 0000000000..1f8d887231 --- /dev/null +++ b/data/2024/aaai/Efficient Learning in Polyhedral Games via Best-Response Oracles @@ -0,0 +1 @@ +We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by normal-form games (NFGs) and extensive-form games (EFGs), when the learning agent is restricted to utilizing a best-response oracle. We show how to achieve constant regret in zero-sum games and O(T^0.25) regret in general-sum games while using only O(log t) best-response queries at a given iteration t, thus improving over the best prior result, which required O(T) queries per iteration. Moreover, our framework yields the first last-iterate convergence guarantees for self-play with best-response oracles in zero-sum games. This convergence occurs at a linear rate, though with a condition-number dependence. We go on to show a O(T^(-0.5)) best-iterate convergence rate without such a dependence. Our results build on linear-rate convergence results for variants of the Frank-Wolfe (FW) algorithm for strongly convex and smooth minimization problems over polyhedral domains. These FW results depend on a condition number of the polytope, known as facial distance. In order to enable application to settings such as EFGs, we show two broad new results: 1) the facial distance for polytopes in standard form is at least γ/k where γ is the minimum value of a nonzero coordinate of a vertex of the polytope and k≤n is the number of tight inequality constraints in the optimal face, and 2) the facial distance for polytopes of the form Ax=b, Cx≤d, x≥0 where x∈R^n, C≥0 is a nonzero integral matrix, and d≥0, is at least 1/(c√n), where c is the infinity norm of C. This yields the first such results for several problems such as sequence-form polytopes, flow polytopes, and matching polytopes. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains b/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains new file mode 100644 index 0000000000..84dc6257d6 --- /dev/null +++ b/data/2024/aaai/Efficient Learning of PDEs via Taylor Expansion and Sparse Decomposition into Value and Fourier Domains @@ -0,0 +1 @@ +Accelerating the learning of Partial Differential Equations (PDEs) from experimental data will speed up the pace of scientific discovery. Previous randomized algorithms exploit sparsity in PDE updates for acceleration. However such methods are applicable to a limited class of decomposable PDEs, which have sparse features in the value domain. We propose Reel, which accelerates the learning of PDEs via random projection and has much broader applicability. Reel exploits the sparsity by decomposing dense updates into sparse ones in both the value and frequency domains. This decomposition enables efficient learning when the source of the updates consists of gradually changing terms across large areas (sparse in the frequency domain) in addition to a few rapid updates concentrated in a small set of “interfacial” regions (sparse in the value domain). Random projection is then applied to compress the sparse signals for learning. To expand the model applicability, Taylor series expansion is used in Reel to approximate the nonlinear PDE updates with polynomials in the decomposable form. Theoretically, we derive a constant factor approximation between the projected loss function and the original one with poly-logarithmic number of projected dimensions. Experimentally, we provide empirical evidence that our proposed Reel can lead to faster learning of PDE models (70-98% reduction in training time when the data is compressed to 1% of its original size) with comparable quality as the non-compressed models. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer b/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer new file mode 100644 index 0000000000..cf1acc050d --- /dev/null +++ b/data/2024/aaai/Efficient Lightweight Image Denoising with Triple Attention Transformer @@ -0,0 +1 @@ +Transformer has shown outstanding performance on image denoising, but the existing Transformer methods for image denoising are with large model sizes and high computational complexity, which is unfriendly to resource-constrained devices. In this paper, we propose a Lightweight Image Denoising Transformer method (LIDFormer) based on Triple Multi-Dconv Head Transposed Attention (TMDTA) to boost computational efficiency. LIDFormer first implements Discrete Wavelet Transform (DWT), which transforms the input image into a low-frequency space, greatly reducing the computational complexity of image denoising. However, the low-frequency image lacks fine-feature information, which degrades the denoising performance. To handle this problem, we introduce the Complementary Periodic Feature Reusing (CPFR) scheme for aggregating the shallow-layer features and the deep-layer features. Furthermore, TMDTA is proposed to integrate global context along three dimensions, thereby enhancing the ability of global feature representation. Note that our method can be applied as a pipeline for both convolutional neural networks and Transformers. Extensive experiments on several benchmarks demonstrate that the proposed LIDFormer achieves a better trade-off between high performance and low computational complexity on real-world image denoising tasks. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution b/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution new file mode 100644 index 0000000000..1aa7bcbc74 --- /dev/null +++ b/data/2024/aaai/Efficient Look-Up Table from Expanded Convolutional Network for Accelerating Image Super-resolution @@ -0,0 +1 @@ +The look-up table (LUT) has recently shown its practicability and effectiveness in super-resolution (SR) tasks due to its low computational cost and hardware independence. However, most existing methods focus on improving the performance of SR, neglecting the demand for high-speed SR on low-computational edge devices. In this paper, we propose an efficient expanded convolution (EC) layer, which expands the output size of regular convolution to enlarge the receptive field (RF) indirectly. It can increase the size of the LUT corresponding to the network linearly with the increase of RF. Additionally, after introducing the EC, multiple LUTs are merged into one LUT, achieving faster running speed while maintaining SR performance. More specifically, we expand the coverage of the convolutional output so that the output at the current position covers the target position and its surroundings, forming an overlapping sliding window at the output end. We sum up the overlapping parts of the sliding window as the output, thereby achieving the effect of enlarging the RF size. Moreover, by expanding the numerical range of the accumulated results and rescaling them to [0,255], the method can mitigate the error caused by quantization output. Experiments indicate that the proposed method performs better than the baseline method and is faster than other LUT-based SR methods. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data b/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data new file mode 100644 index 0000000000..617fb140e6 --- /dev/null +++ b/data/2024/aaai/Efficient Nonparametric Tensor Decomposition for Binary and Count Data @@ -0,0 +1 @@ +In numerous applications, binary reactions or event counts are observed and stored within high-order tensors. Tensor decompositions (TDs) serve as a powerful tool to handle such high-dimensional and sparse data. However, many traditional TDs are explicitly or implicitly designed based on the Gaussian distribution, which is unsuitable for discrete data. Moreover, most TDs rely on predefined multi-linear structures, such as CP and Tucker formats. Therefore, they may not be effective enough to handle complex real-world datasets. To address these issues, we propose ENTED, an Efficient Nonparametric TEnsor Decomposition for binary and count tensors. Specifically, we first employ a nonparametric Gaussian process (GP) to replace traditional multi-linear structures. Next, we utilize the Pólya-Gamma augmentation which provides a unified framework to establish conjugate models for binary and count distributions. Finally, to address the computational issue of GPs, we enhance the model by incorporating sparse orthogonal variational inference of inducing points, which offers a more effective covariance approximation within GPs and stochastic natural gradient updates for nonparametric models. We evaluate our model on several real-world tensor completion tasks, considering binary and count datasets. The results manifest both better performance and computational advantages of the proposed model. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications b/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications new file mode 100644 index 0000000000..44634a5235 --- /dev/null +++ b/data/2024/aaai/Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications @@ -0,0 +1 @@ +Satellite data bolstered by their increasing accessibility is leading to many endeavors of automated monitoring of the earth's surface for various applications. Such applications demand high spatial resolution images at a temporal resolution of a few days which entails the challenge of processing a huge volume of image time series data. To overcome this computing bottleneck, we present PatchNet, a bespoke adaptation of beam search and attention mechanism. PatchNet is an automated patch selection neural network that requires only a partial spatial traversal of an image time series and yet achieves impressive results. Satellite systems face a trade-off between spatial and temporal resolutions due to budget/technical constraints e.g., Landsat-8/9 or Sentinel-2 have high spatial resolution whereas, MODIS has high temporal resolution. To deal with the limitation of coarse temporal resolution, we propose FuSITSNet, a twofold feature-based generic fusion model with multimodal learning in a contrastive setting. It produces a learned representation after fusion of two satellite image time series leveraging finer spatial resolution of Landsat and finer temporal resolution of MODIS. The patch alignment module of FuSITSNet aligns the PatchNet processed patches of Landsat-8 with the corresponding MODIS regions to incorporate its finer resolution temporal features. The untraversed patches are handled by the cross-modality attention which highlights additional hot spot features from the two modalities. We conduct extensive experiments on more than 2000 counties of US for crop yield, snow cover, and solar energy prediction and show that even one-fourth spatial processing of image time series produces state-of-the-art results. FuSITSNet outperforms the predictions of single modality and data obtained using existing generative fusion models and allows for monitoring of dynamic phenomena using freely accessible images, thereby unlocking new opportunities. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning b/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning new file mode 100644 index 0000000000..ea40de081c --- /dev/null +++ b/data/2024/aaai/Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning @@ -0,0 +1,2 @@ +The next generation of machine intelligence requires the capability of continual learning to acquire new knowledge without forgetting the old one while conserving limited computing resources. +Spiking neural networks (SNNs), compared to artificial neural networks (ANNs), have more characteristics that align with biological neurons, which may be helpful as a potential gating function for knowledge maintenance in neural networks. Inspired by the selective sparse activation principle of context gating in biological systems, we present a novel SNN model with selective activation to achieve continual learning. The trace-based K-Winner-Take-All (K-WTA) and variable threshold components are designed to form the sparsity in selective activation in spatial and temporal dimensions of spiking neurons, which promotes the subpopulation of neuron activation to perform specific tasks. As a result, continual learning can be maintained by routing different tasks via different populations of neurons in the network. The experiments are conducted on MNIST and CIFAR10 datasets under the class incremental setting. The results show that the proposed SNN model achieves competitive performance similar to and even surpasses the other regularization-based methods deployed under traditional ANNs. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution b/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution new file mode 100644 index 0000000000..87a4b49b96 --- /dev/null +++ b/data/2024/aaai/Efficient Target Propagation by Deriving Analytical Solution @@ -0,0 +1 @@ +Exploring biologically plausible algorithms as alternatives to error backpropagation (BP) is a challenging research topic in artificial intelligence. It also provides insights into the brain's learning methods. Recently, when combined with well-designed feedback loss functions such as Local Difference Reconstruction Loss (LDRL) and through hierarchical training of feedback pathway synaptic weights, Target Propagation (TP) has achieved performance comparable to BP in image classification tasks. However, with an increase in the number of network layers, the tuning and training cost of feedback weights escalates. Drawing inspiration from the work of Ernoult et al., we propose a training method that seeks the optimal solution for feedback weights. This method enhances the efficiency of feedback training by analytically minimizing feedback loss, allowing the feedback layer to skip certain local training iterations. More specifically, we introduce the Jacobian matching loss (JML) for feedback training. We also proactively implement layers designed to derive analytical solutions that minimize JML. Through experiments, we have validated the effectiveness of this approach. Using the CIFAR-10 dataset, our method showcases accuracy levels comparable to state-of-the-art TP methods. Furthermore, we have explored its effectiveness in more intricate network architectures. \ No newline at end of file diff --git a/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models b/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models new file mode 100644 index 0000000000..1a2ff0fa02 --- /dev/null +++ b/data/2024/aaai/Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models @@ -0,0 +1,3 @@ +Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ML tasks. +However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. +Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability. \ No newline at end of file diff --git a/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation b/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation new file mode 100644 index 0000000000..0167fce5ce --- /dev/null +++ b/data/2024/aaai/Electron Microscopy Images as Set of Fragments for Mitochondrial Segmentation @@ -0,0 +1 @@ +Automatic mitochondrial segmentation enjoys great popularity with the development of deep learning. However, the coarse prediction raised by the presence of regular 3D grids in previous methods regardless of 3D CNN or the vision transformers suggest a possibly sub-optimal feature arrangement. To mitigate this limitation, we attempt to interpret the 3D EM image stacks as a set of interrelated 3D fragments for a better solution. However, it is non-trivial to model the 3D fragments without introducing excessive computational overhead. In this paper, we design a coherent fragment vision transformer (FragViT) combined with affinity learning to manipulate features on 3D fragments yet explore mutual relationships to model fragment-wise context, enjoying locality prior without sacrificing global reception. The proposed FragViT includes a fragment encoder and a hierarchical fragment aggregation module. The fragment encoder is equipped with affinity heads to transform the tokens into fragments with homogeneous semantics, and the multi-layer self-attention is used to explicitly learn inter-fragment relations with long-range dependencies. The hierarchical fragment aggregation module is responsible for hierarchically aggregating fragment-wise prediction back to the final voxel-wise prediction in a progressive manner. Extensive experimental results on the challenging MitoEM, Lucchi, and AC3/AC4 benchmarks demonstrate the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift b/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift new file mode 100644 index 0000000000..1a02435fa7 --- /dev/null +++ b/data/2024/aaai/Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift @@ -0,0 +1 @@ +Diffusion models (DM) have become state-of-the-art generative models because of their capability of generating high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on over hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility. \ No newline at end of file diff --git a/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration b/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration new file mode 100644 index 0000000000..3103dd2794 --- /dev/null +++ b/data/2024/aaai/EmFORE: Learning Email Folder Classification Rules by Demonstration @@ -0,0 +1 @@ +Tools that help with email folder management are limited, as users have to manually write rules to assign emails to folders. We present EMFORE, an iterative learning system that automatically learns and updates such rules from observations. EMFORE is fast enough to suggest and update rules in real time and suppresses mails with low confidence to reduce the number of false positives. EMFORE can use different rule grammars, and thus be adapted to different clients, without changing the user experience. Previous methods do not learn rules, require complete retraining or multiple new examples after making a mistake, and do not distinguish between inbox and other folders. EMFORE learns rules incrementally and can make the neutral decision of leaving emails in the inbox, making it an ideal candidate for integration in email clients. \ No newline at end of file diff --git a/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering b/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering new file mode 100644 index 0000000000..c63242c1f4 --- /dev/null +++ b/data/2024/aaai/Embedded Feature Selection on Graph-Based Multi-View Clustering @@ -0,0 +1 @@ +Recently, anchor graph-based multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To solve these problems, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) approach to improve the clustering performance. Our method decomposes anchor graphs, taking advantage of memory efficiency, to obtain clustering labels in a single step without the need for post-processing. Furthermore, we introduce the l2,p-norm for graph-based feature selection, which selects the most relevant data for efficient graph factorization. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning b/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning new file mode 100644 index 0000000000..58bf729aa8 --- /dev/null +++ b/data/2024/aaai/Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning @@ -0,0 +1 @@ +While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM. \ No newline at end of file diff --git a/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization b/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization new file mode 100644 index 0000000000..cb7ffbd31c --- /dev/null +++ b/data/2024/aaai/Emergent Communication for Numerical Concepts Generalization @@ -0,0 +1 @@ +Research on emergent communication has recently gained significant traction as a promising avenue for the linguistic community to unravel human language's origins and explore artificial intelligence's generalization capabilities. Current research has predominantly concentrated on recognizing qualitative patterns of object attributes(e.g., shape and color) and paid little attention to the quantitative relationship among object quantities which is known as the part of numerical concepts. The ability to generalize numerical concepts, i.e., counting and calculations with unseen quantities, is essential, as it mirrors humans' foundational abstract reasoning abilities. In this work, we introduce the NumGame, leveraging the referential game framework, forcing agents to communicate and generalize the numerical concepts effectively. Inspired by the human learning process of numbers, we present a two-stage training approach that sequentially fosters a rudimentary numerical sense followed by the ability of arithmetic calculation, ultimately aiding agents in generating semantically stable and unambiguous language for numerical concepts. The experimental results indicate the impressive generalization capabilities to unseen quantities and regularity of the language emergence from communication. \ No newline at end of file diff --git a/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling b/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling new file mode 100644 index 0000000000..f047642cb3 --- /dev/null +++ b/data/2024/aaai/Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling @@ -0,0 +1 @@ +Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS. \ No newline at end of file diff --git a/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations b/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations new file mode 100644 index 0000000000..a1b5b9e258 --- /dev/null +++ b/data/2024/aaai/Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations @@ -0,0 +1 @@ +Recently, the explanation of neural network models has garnered considerable research attention. In computer vision, CAM (Class Activation Map)-based methods and LRP (Layer-wise Relevance Propagation) method are two common explanation methods. However, since most CAM-based methods can only generate global weights, they can only generate coarse-grained explanations at a deep layer. LRP and its variants, on the other hand, can generate fine-grained explanations. But the faithfulness of the explanations is too low. To address these challenges, in this paper, we propose FG-CAM (Fine-Grained CAM), which extends CAM-based methods to enable generating fine-grained and high-faithfulness explanations. FG-CAM uses the relationship between two adjacent layers of feature maps with resolution differences to gradually increase the explanation resolution, while finding the contributing pixels and filtering out the pixels that do not contribute. Our method not only solves the shortcoming of CAM-based methods without changing their characteristics, but also generates fine-grained explanations that have higher faithfulness than LRP and its variants. We also present FG-CAM with denoising, which is a variant of FG-CAM and is able to generate less noisy explanations with almost no change in explanation faithfulness. Experimental results show that the performance of FG-CAM is almost unaffected by the explanation resolution. FG-CAM outperforms existing CAM-based methods significantly in both shallow and intermediate layers, and outperforms LRP and its variants significantly in the input layer. Our code is available at https://github.com/dongmo-qcq/FG-CAM. \ No newline at end of file diff --git a/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals b/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals new file mode 100644 index 0000000000..652f4542c1 --- /dev/null +++ b/data/2024/aaai/EnColor: Improving Visual Accessibility with a Deep Encoder-Decoder Image Corrector for Color Vision Deficient Individuals @@ -0,0 +1 @@ +Individuals with color vision deficiencies (CVDs) often face significant challenges in accessing vital information for decision-making. In response, we introduce EnColor—a deep Encoder-decoder Color corrector for images, enabling individuals with CVDs to perceive the contents in originally intended colorization. Our network architecture is designed to effectively capture essential visual features for reconstructing standard images into color-corrected versions. In particular, our training pipeline is integrated with a CVD simulator so as to ensure the fidelity of output throughout the lens of individuals with impaired color vision. For evaluation, we focus primarily on tomato images, considering the profound impact of color vision deficiencies on practical domains like agri-food systems. Our quantitative results demonstrate that the EnColor model achieves over 16.8% improvement over previously introduced algorithms in terms of color retention, supporting our design choices. Furthermore, a survey with 43 participants provides subjective assessments with the highest scores on our method. Additionally, specific visual examples are presented to highlight accurately restored colors. We also publicly share all our codes of EnColor as well as the baseline methods to ensure reproducibility and facilitate more studies in CVD correction. \ No newline at end of file diff --git a/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization b/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization new file mode 100644 index 0000000000..affe77182d --- /dev/null +++ b/data/2024/aaai/EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization @@ -0,0 +1 @@ +Matchmaking is a core task in e-sports and online games, as it contributes to player engagement and further influences the game's lifecycle. Previous methods focus on creating fair games at all times. They divide players into different tiers based on skill levels and only select players from the same tier for each game. Though this strategy can ensure fair matchmaking, it is not always good for player engagement. In this paper, we propose a novel Engagement-oriented Matchmaking (EnMatch) framework to ensure fair games and simultaneously enhance player engagement. Two main issues need to be addressed. First, it is unclear how to measure the impact of different team compositions and confrontations on player engagement during the game considering the variety of player characteristics. Second, such a detailed consideration on every single player during matchmaking will result in an NP-hard combinatorial optimization problem with non-linear objectives. In light of these challenges, we turn to real-world data analysis to reveal engagement-related factors. The resulting insights guide the development of engagement modeling, enabling the estimation of quantified engagement before a match is completed. To handle the combinatorial optimization problem, we formulate the problem into a reinforcement learning framework, in which a neural combinatorial optimization problem is built and solved. The performance of EnMatch is finally demonstrated through the comparison with other state-of-the-art methods based on several real-world datasets and online deployments on two games. \ No newline at end of file diff --git a/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP b/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP new file mode 100644 index 0000000000..b23cff783c --- /dev/null +++ b/data/2024/aaai/Encoding Constraints as Binary Constraint Networks Satisfying BTP @@ -0,0 +1 @@ +Recently, the Binary Constraint Tree (BCT), a tree structured Binary Constraint Network (BCN), has been shown to be more succinct than various ad-hoc constraints. In this paper, we investigate the modelling power of a well-known tractable hybrid class generalizing BCT, i.e. the class of BCNs satisfying Broken Triangle Property (BTP) called BTP Networks (BTPNs). We show that the consistency checker of BTPN can be computed by polysize monotone circuit, thus, some global constraints cannot be encoded as polysize BTPN, such as the AllDifferent and Linear constraints. Then our study reveals that BTPN is strictly more succinct than the DNNF constraint and all 14 ad-hoc constraints discussed in (Wang and Yap 2023), such as the context-free grammar, BCT and smart table constraints. Furthermore, we also show that BTPN is as powerful as DNNF in terms of computing various operations and queries. In addition, we prove that it is NP-hard to determine the minimum sized BTPN encoding a constraint. \ No newline at end of file diff --git a/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection b/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection new file mode 100644 index 0000000000..5c0e2b0f8f --- /dev/null +++ b/data/2024/aaai/EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection @@ -0,0 +1 @@ +In the rapidly growing digital economy, protecting intellectual property (IP) associated with digital products has become increasingly important. Within this context, machine learning (ML) models, being highly valuable digital assets, have gained significant attention for IP protection. This paper introduces a practical encryption-based framework called EncryIP, which seamlessly integrates a public-key encryption scheme into the model learning process. This approach enables the protected model to generate randomized and confused labels, ensuring that only individuals with accurate secret keys, signifying authorized users, can decrypt and reveal authentic labels. Importantly, the proposed framework not only facilitates the protected model to multiple authorized users without requiring repetitive training of the original ML model with IP protection methods but also maintains the model's performance without compromising its accuracy. Compared to existing methods like watermark-based, trigger-based, and passport-based approaches, EncryIP demonstrates superior effectiveness in both training protected models and efficiently detecting the unauthorized spread of ML models. \ No newline at end of file diff --git a/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding b/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding new file mode 100644 index 0000000000..d90326e27c --- /dev/null +++ b/data/2024/aaai/End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding @@ -0,0 +1 @@ +It is important to automatically discover the underlying tree-structured formulae from large amounts of data. In this paper, we examine learning linear temporal logic on finite traces (LTLf) formulae, which is a tree structure syntactically and characterizes temporal properties semantically. Its core challenge is to bridge the gap between the concise tree-structured syntax and the complex LTLf semantics. Besides, the learning quality is endangered by explosion of the search space and wrong search bias guided by imperfect data. We tackle these challenges by proposing an LTLf encoding method to parameterize a neural network so that the neural computation is able to simulate the inference of LTLf formulae. We first identify faithful LTLf encoding, a subclass of LTLf encoding, which has a one-to-one correspondence to LTLf formulae. Faithful encoding guarantees that the learned parameter assignment of the neural network can directly be interpreted to an LTLf formula. With such an encoding method, we then propose an end-to-end approach, TLTLf, to learn LTLf formulae through neural networks parameterized by our LTLf encoding method. Experimental results demonstrate that our approach achieves state-of-the-art performance with up to 7% improvement in accuracy, highlighting the benefits of introducing the faithful LTLf encoding. \ No newline at end of file diff --git a/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning b/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning new file mode 100644 index 0000000000..a6514c116a --- /dev/null +++ b/data/2024/aaai/End-to-End Phase Field Model Discovery Combining Experimentation, Crowdsourcing, Simulation and Learning @@ -0,0 +1 @@ +The availability of tera-byte scale experiment data calls for AI driven approaches which automatically discover scientific models from data. Nonetheless, significant challenges present in AI-driven scientific discovery: (i) The annotation of large scale datasets requires fundamental re-thinking in developing scalable crowdsourcing tools. (ii) The learning of scientific models from data calls for innovations beyond black-box neural nets. (iii) Novel visualization & diagnosis tools are needed for the collaboration of experimental and theoretical physicists, and computer scientists. We present Phase-Field-Lab platform for end-to-end phase field model discovery, which automatically discovers phase field physics models from experiment data, integrating experimentation, crowdsourcing, simulation and learning. Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time (by ~50-75%), while increasing annotation accuracy compared to baseline; (ii) an end-to-end neural model which automatically learns phase field models from data by embedding phase field simulation and existing domain knowledge into learning; and (iii) novel interfaces and visualizations to integrate our platform into the scientific discovery cycle of domain scientists. Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions (high temperature and irradiation). Our approach reveals new properties of nano-void defects, which otherwise cannot be detected via manual analysis. \ No newline at end of file diff --git a/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy b/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy new file mode 100644 index 0000000000..2fd3e5506d --- /dev/null +++ b/data/2024/aaai/End-to-End RGB-D Image Compression via Exploiting Channel-Modality Redundancy @@ -0,0 +1 @@ +As a kind of 3D data, RGB-D images have been extensively used in object tracking, 3D reconstruction, remote sensing mapping, and other tasks. In the realm of computer vision, the significance of RGB-D images is progressively growing. However, the existing learning-based image compression methods usually process RGB images and depth images separately, which cannot entirely exploit the redundant information between the modalities, limiting the further improvement of the Rate-Distortion performance. With the goal of overcoming the defect, in this paper, we propose a learning-based dual-branch RGB-D image compression framework. Compared with traditional RGB domain compression scheme, a YUV domain compression scheme is presented for spatial redundancy removal. In addition, Intra-Modality Attention (IMA) and Cross-Modality Attention (CMA) are introduced for modal redundancy removal. For the sake of benefiting from cross-modal prior information, Context Prediction Module (CPM) and Context Fusion Module (CFM) are raised in the conditional entropy model which makes the context probability prediction more accurate. The experimental results demonstrate our method outperforms existing image compression methods in two RGB-D image datasets. Compared with BPG, our proposed framework can achieve up to 15% bit rate saving for RGB images. \ No newline at end of file diff --git a/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer b/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer new file mode 100644 index 0000000000..abc3348884 --- /dev/null +++ b/data/2024/aaai/End-to-End Real-Time Vanishing Point Detection with Transformer @@ -0,0 +1 @@ +In this paper, we propose a novel transformer-based end-to-end real-time vanishing point detection method, which is named Vanishing Point TRansformer (VPTR). The proposed method can directly regress the locations of vanishing points from given images. To achieve this goal, we pose vanishing point detection as a point object detection task on the Gaussian hemisphere with region division. Considering low-level features always provide more geometric information which can contribute to accurate vanishing point prediction, we propose a clear architecture where vanishing point queries in the decoder can directly gather multi-level features from CNN backbone with deformable attention in VPTR. Our method does not rely on line detection or Manhattan world assumption, which makes it more flexible to use. VPTR runs at an inferring speed of 140 FPS on one NVIDIA 3090 card. Experimental results on synthetic and real-world datasets demonstrate that our method can be used in both natural and structural scenes, and is superior to other state-of-the-art methods on the balance of accuracy and efficiency. \ No newline at end of file diff --git a/data/2024/aaai/End-to-End Verification for Subgraph Solving b/data/2024/aaai/End-to-End Verification for Subgraph Solving new file mode 100644 index 0000000000..b8cd0d45ef --- /dev/null +++ b/data/2024/aaai/End-to-End Verification for Subgraph Solving @@ -0,0 +1,3 @@ +Modern subgraph-finding algorithm implementations consist of thousands of lines of highly optimized code, and this complexity raises questions about their trustworthiness. Recently, some state-of-the-art subgraph solvers have been enhanced to output machine-verifiable proofs that their results are correct. While this significantly improves reliability, it is not a fully satisfactory solution, since end-users have to trust both the proof checking algorithms and the translation of the high-level graph problem into a low-level 0-1 integer linear program (ILP) used for the proofs. + +In this work, we present the first formally verified toolchain capable of full end-to-end verification for subgraph solving, which closes both of these trust gaps. We have built encoder frontends for various graph problems together with a 0-1 ILP (a.k.a. pseudo-Boolean) proof checker, all implemented and formally verified in the CakeML ecosystem. This toolchain is flexible and extensible, and we use it to build verified proof checkers for both decision and optimization graph problems, namely, subgraph isomorphism, maximum clique, and maximum common (connected) induced subgraph. Our experimental evaluation shows that end-to-end formal verification is now feasible for a wide range of hard graph problems. \ No newline at end of file diff --git a/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration b/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration new file mode 100644 index 0000000000..57cf9ccf98 --- /dev/null +++ b/data/2024/aaai/Energy Efficient Streaming Time Series Classification with Attentive Power Iteration @@ -0,0 +1 @@ +Efficiently processing time series data streams in real-time on resource-constrained devices offers significant advantages in terms of enhanced computational energy efficiency and reduced time-related risks. We introduce an innovative streaming time series classification network that utilizes attentive power iteration, enabling real-time processing on resource-constrained devices. Our model continuously updates a compact representation of the entire time series, enhancing classification accuracy while conserving energy and processing time. Notably, it excels in streaming scenarios without requiring complete time series access, enabling swift decisions. Experimental results show that our approach excels in classification accuracy and energy efficiency, with over 70% less consumption and threefold faster task completion than benchmarks. This work advances real-time responsiveness, energy conservation, and operational effectiveness for constrained devices, contributing to optimizing various applications. \ No newline at end of file diff --git a/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter b/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter new file mode 100644 index 0000000000..23604070e3 --- /dev/null +++ b/data/2024/aaai/Engineering an Exact Pseudo-Boolean Model Counter @@ -0,0 +1,3 @@ +Model counting, a fundamental task in computer science, involves determining the number of satisfying assignments to a Boolean formula, typically represented in conjunctive normal form (CNF). While model counting for CNF formulas has received extensive attention with a broad range of applications, the study of model counting for Pseudo-Boolean (PB) formulas has been relatively overlooked. Pseudo-Boolean formulas, being more succinct than propositional Boolean formulas, offer greater flexibility in representing real-world problems. Consequently, there is a crucial need to investigate efficient techniques for model counting for PB formulas. + +In this work, we propose the first exact Pseudo-Boolean model counter, PBCount , that relies on knowledge compilation approach via algebraic decision diagrams. Our extensive empirical evaluation shows that PBCount can compute counts for 1513 instances while the current state-of-the-art approach could only handle 1013 instances. Our work opens up several avenues for future work in the context of model counting for PB formulas, such as the development of preprocessing techniques and exploration of approaches other than knowledge compilation. \ No newline at end of file diff --git a/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract) b/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract) new file mode 100644 index 0000000000..2906899f0f --- /dev/null +++ b/data/2024/aaai/Enhance Diversified Top-k MaxSAT Solving by Incorporating New Strategy for Generating Diversified Initial Assignments (Student Abstract) @@ -0,0 +1 @@ +The Diversified Top-k MaxSAT (DTKMS) problem is an extension of MaxSAT. The objective of DTKMS is to find k feasible assignments of a given formula, such that each assignment satisfies all hard clauses and the k assignments together satisfy the maximum number of soft clauses. This paper presents a local search algorithm, DTKMS-DIA, which incorporates a new approach to generating initial assignments. Experimental results indicate that DTKMS-DIA can achieve attractive performance on 826 instances compared with state-of-the-art solvers. \ No newline at end of file diff --git a/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing b/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing new file mode 100644 index 0000000000..96a05bcb00 --- /dev/null +++ b/data/2024/aaai/Enhance Sketch Recognition's Explainability via Semantic Component-Level Parsing @@ -0,0 +1 @@ +Free-hand sketches are appealing for humans as a universal tool to depict the visual world. Humans can recognize varied sketches of a category easily by identifying the concurrence and layout of the intrinsic semantic components of the category, since humans draw free-hand sketches based a common consensus that which types of semantic components constitute each sketch category. For example, an airplane should at least have a fuselage and wings. Based on this analysis, a semantic component-level memory module is constructed and embedded in the proposed structured sketch recognition network in this paper. The memory keys representing semantic components of each sketch category can be self-learned and enhance the recognition network's explainability. Our proposed networks can deal with different situations of sketch recognition, i.e., with or without semantic components labels of strokes. Experiments on the SPG and SketchIME datasets demonstrate the memory module's flexibility and the recognition network's explainability. The code and data are available at https://github.com/GuangmingZhu/SketchESC. \ No newline at end of file diff --git a/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis b/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis new file mode 100644 index 0000000000..cd8251994d --- /dev/null +++ b/data/2024/aaai/Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis @@ -0,0 +1 @@ +The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor. \ No newline at end of file diff --git a/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract) b/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract) new file mode 100644 index 0000000000..ac89d528cf --- /dev/null +++ b/data/2024/aaai/Enhanced Optical Character Recognition by Optical Sensor Combined with BERT and Cosine Similarity Scoring (Student Abstract) @@ -0,0 +1 @@ +Optical character recognition(OCR) is the technology to identify text characters embedded within images. Conventional OCR models exhibit performance degradation when performing with noisy images. To solve this problem, we propose a novel model, which combines computer vision using optical sensor with natural language processing by bidirectional encoder representations from transformers(BERT) and cosine similarity scoring. The proposed model uses a confidence rate to determine whether to utilize optical sensor alone or BERT/cosine similarity scoring combined with the optical sensor. Experimental results show that the proposed model outperforms approximately 4.34 times better than the conventional OCR. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving b/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving new file mode 100644 index 0000000000..40c4a30781 --- /dev/null +++ b/data/2024/aaai/Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving @@ -0,0 +1 @@ +Most Bilingual Lexicon Induction (BLI) methods retrieve word translation pairs by finding the closest target word for a given source word based on cross-lingual word embeddings (WEs). However, we find that solely retrieving translation from the source-to-target perspective leads to some false positive translation pairs, which significantly harm the precision of BLI. To address this problem, we propose a novel and effective method to improve translation pair retrieval in cross-lingual WEs. Specifically, we consider both source-side and target-side perspectives throughout the retrieval process to alleviate false positive word pairings that emanate from a single perspective. On a benchmark dataset of BLI, our proposed method achieves competitive performance compared to existing state-of-the-art (SOTA) methods. It demonstrates effectiveness and robustness across six experimental languages, including similar language pairs and distant language pairs, under both supervised and unsupervised settings. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach b/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach new file mode 100644 index 0000000000..dba2c0e763 --- /dev/null +++ b/data/2024/aaai/Enhancing Cognitive Diagnosis Using Un-interacted Exercises: A Collaboration-Aware Mixed Sampling Approach @@ -0,0 +1 @@ +Cognitive diagnosis is a crucial task in computer-aided education, aimed at evaluating students' proficiency levels across various knowledge concepts through exercises. Current models, however, primarily rely on students' answered exercises, neglecting the complex and rich information contained in un-interacted exercises. While recent research has attempted to leverage the data within un-interacted exercises linked to interacted knowledge concepts, aiming to address the long-tail issue, these studies fail to fully explore the informative, un-interacted exercises related to broader knowledge concepts. This oversight results in diminished performance when these models are applied to comprehensive datasets. In response to this gap, we present the Collaborative-aware Mixed Exercise Sampling (CMES) framework, which can effectively exploit the information present in un-interacted exercises linked to un-interacted knowledge concepts. Specifically, we introduce a novel universal sampling module where the training samples comprise not merely raw data slices, but enhanced samples generated by combining weight-enhanced attention mixture techniques. Given the necessity of real response labels in cognitive diagnosis, we also propose a ranking-based pseudo feedback module to regulate students' responses on generated exercises. The versatility of the CMES framework bolsters existing models and improves their adaptability. Finally, we demonstrate the effectiveness and interpretability of our framework through comprehensive experiments on real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights b/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights new file mode 100644 index 0000000000..140b109769 --- /dev/null +++ b/data/2024/aaai/Enhancing Ensemble Clustering with Adaptive High-Order Topological Weights @@ -0,0 +1 @@ +Ensemble clustering learns more accurate consensus results from a set of weak base clustering results. This technique is more challenging than other clustering algorithms due to the base clustering result set's randomness and the inaccessibility of data features. Existing ensemble clustering methods rely on the Co-association (CA) matrix quality but lack the capability to handle missing connections in base clustering. Inspired by the neighborhood high-order and topological similarity theories, this paper proposes a topological ensemble model based on high-order information. Specifically, this paper compensates for missing connections by mining neighborhood high-order connection information in the CA matrix and learning optimal connections with adaptive weights. Afterward, the learned excellent connections are embedded into topology learning to capture the topology of the base clustering. Finally, we incorporate adaptive high-order connection representation and topology learning into a unified learning framework. To our knowledge, this is the first ensemble clustering work based on topological similarity and high-order connectivity relations. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed method. The source code of the proposed approach is available at https://github.com/ltyong/awec. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models b/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models new file mode 100644 index 0000000000..e2a848aa3b --- /dev/null +++ b/data/2024/aaai/Enhancing Healthcare Predictions with Deep Learning Models @@ -0,0 +1 @@ +This study leverages Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to enhance diagnostics and predictions in healthcare. By training on extensive healthcare datasets, this project aims to improve early disease detection and health risk assessments. Evaluation emphasizes accuracy, reliability, and ethical considerations, including bias mitigation. This research promises to bridge AI advancements and clinical applications, offering significant improvements in diagnostic capabilities and healthcare accessibility. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks b/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks new file mode 100644 index 0000000000..dc5692d896 --- /dev/null +++ b/data/2024/aaai/Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks @@ -0,0 +1,3 @@ +Recommending suitable jobs to users is a critical task in online recruitment platforms. While existing job recommendation methods encounter challenges such as the low quality of users' resumes, which hampers their accuracy and practical effectiveness.With the rapid development of large language models (LLMs), utilizing the rich external knowledge encapsulated within them, as well as their powerful reasoning capabilities, is a promising way to complete users' resumes for more accurate recommendations. However, directly leveraging LLMs to enhance recommendation results is not a one-size-fits-all solution, as LLMs may suffer from fabricated generation and few-shot problems, which degrade the quality of resume completion. + +In this paper, we propose a novel LLM-based approach for job recommendation. To alleviate the limitation of fabricated generation for LLMs, we extract accurate and valuable information beyond users' self-description, which helps the LLMs better profile users for resume completion. Specifically, we not only extract users' explicit properties (e.g., skills, interests) from their self-description but also infer users' implicit characteristics from their behaviors for more accurate and meaningful resume completion. Nevertheless, some users still suffer from few-shot problems, which arise due to scarce interaction records, leading to limited guidance for high-quality resume generation. To address this issue, we propose aligning unpaired low-quality with high-quality generated resumes by Generative Adversarial Networks (GANs), which can refine the resume representations for better recommendation results. Extensive experiments on three large real-world recruitment datasets demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling b/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling new file mode 100644 index 0000000000..1340771ff7 --- /dev/null +++ b/data/2024/aaai/Enhancing Low-Resource Relation Representations through Multi-View Decoupling @@ -0,0 +1,6 @@ +Recently, prompt-tuning with pre-trained language models (PLMs) has demonstrated the significantly enhancing ability of relation extraction (RE) tasks. +However, in low-resource scenarios, where the available training data is scarce, previous prompt-based methods may still perform poorly for prompt-based representation learning due to a superficial understanding of the relation. +To this end, we highlight the importance of learning high-quality relation representation in low-resource scenarios for RE, and propose a novel prompt-based relation representation method, named MVRE (Multi-View Relation Extraction), to better leverage the capacity of PLMs to improve the performance of RE within the low-resource prompt-tuning paradigm. Specifically, MVRE decouples each relation into different perspectives to encompass multi-view relation representations for maximizing the likelihood during relation inference. +Furthermore, we also design a Global-Local loss and a Dynamic-Initialization method for better alignment of the multi-view relation-representing virtual words, containing the semantics of relation labels during the optimization learning process and initialization. Extensive experiments on +three benchmark datasets show that our method can achieve +state-of-the-art in low-resource settings. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs b/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs new file mode 100644 index 0000000000..f1ea2f7b3b --- /dev/null +++ b/data/2024/aaai/Enhancing Machine Translation Experiences with Multilingual Knowledge Graphs @@ -0,0 +1 @@ +Translating entity names, especially when a literal translation is not correct, poses a significant challenge. Although Machine Translation (MT) systems have achieved impressive results, they still struggle to translate cultural nuances and language-specific context. In this work, we show that the integration of multilingual knowledge graphs into MT systems can address this problem and bring two significant benefits: i) improving the translation of utterances that contain entities by leveraging their human-curated aliases from a multilingual knowledge graph, and, ii) increasing the interpretability of the translation process by providing the user with information from the knowledge graph. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning b/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning new file mode 100644 index 0000000000..d6dc70b6ee --- /dev/null +++ b/data/2024/aaai/Enhancing Multi-Label Classification via Dynamic Label-Order Learning @@ -0,0 +1 @@ +Generative methods tackle Multi-Label Classification (MLC) by autoregressively generating label sequences. These methods excel at modeling label correlations and have achieved outstanding performance. However, a key challenge is determining the order of labels, as empirical findings indicate the significant impact of different orders on model learning and inference. Previous works adopt static label-ordering methods, assigning a unified label order for all samples based on label frequencies or co-occurrences. Nonetheless, such static methods neglect the unique semantics of each sample. More critically, these methods can cause the model to rigidly memorize training order, resulting in missing labels during inference. In light of these limitations, this paper proposes a dynamic label-order learning approach that adaptively learns a label order for each sample. Specifically, our approach adopts a difficulty-prioritized principle and iteratively constructs the label sequence based on the sample s semantics. To reduce the additional cost incurred by label-order learning, we use the same SEQ2SEQ model for label-order learning and MLC learning and introduce a unified loss function for joint optimization. Extensive experiments on public datasets reveal that our approach greatly outperforms previous methods. We will release our code at https: //github.com/KagamiBaka/DLOL. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning b/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning new file mode 100644 index 0000000000..ecf66e3aae --- /dev/null +++ b/data/2024/aaai/Enhancing Multi-Scale Diffusion Prediction via Sequential Hypergraphs and Adversarial Learning @@ -0,0 +1 @@ +Information diffusion prediction plays a crucial role in understanding the propagation of information in social networks, encompassing both macroscopic and microscopic prediction tasks. Macroscopic prediction estimates the overall impact of information diffusion, while microscopic prediction focuses on identifying the next user to be influenced. While prior research often concentrates on one of these aspects, a few tackle both concurrently. These two tasks provide complementary insights into the diffusion process at different levels, revealing common traits and unique attributes. The exploration of leveraging common features across these tasks to enhance information prediction remains an underexplored avenue. In this paper, we propose an intuitive and effective model that addresses both macroscopic and microscopic prediction tasks. Our approach considers the interactions and dynamics among cascades at the macro level and incorporates the social homophily of users in social networks at the micro level. Additionally, we introduce adversarial training and orthogonality constraints to ensure the integrity of shared features. Experimental results on four datasets demonstrate that our model significantly outperforms state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis b/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis new file mode 100644 index 0000000000..5b25790318 --- /dev/null +++ b/data/2024/aaai/Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis @@ -0,0 +1 @@ +Neural Radiance Fields (NeRF) have made significant strides in the modeling and rendering of 3D scenes. However, due to the complexity of luminance information, existing NeRF methods often struggle to produce satisfactory renderings when dealing with high and low exposure images. To address this issue, we propose an innovative approach capable of effectively modeling and rendering images under multiple exposure conditions. Our method adaptively learns the characteristics of images under different exposure conditions through an unsupervised evaluator-simulator structure for HDR (High Dynamic Range) fusion. This approach enhances NeRF's comprehension and handling of light variations, leading to the generation of images with appropriate brightness. Simultaneously, we present a bilevel optimization method tailored for novel view synthesis, aiming to harmonize the luminance information of input images while preserving their structural and content consistency. This approach facilitates the concurrent optimization of multi-exposure correction and novel view synthesis, in an unsupervised manner. Through comprehensive experiments conducted on the LOM and LOL datasets, our approach surpasses existing methods, markedly enhancing the task of novel view synthesis for multi-exposure environments and attaining state-of-the-art results. The source code can be found at https://github.com/Archer-204/AME-NeRF. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation b/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation new file mode 100644 index 0000000000..6c91ed68bd --- /dev/null +++ b/data/2024/aaai/Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation @@ -0,0 +1 @@ +In the domain of real-world agents, the application of Reinforcement Learning (RL) remains challenging due to the necessity for safety constraints. Previously, Constrained Reinforcement Learning (CRL) has predominantly focused on on-policy algorithms. Although these algorithms exhibit a degree of efficacy, their interactivity efficiency in real-world settings is sub-optimal, highlighting the demand for more efficient off-policy methods. However, off-policy CRL algorithms grapple with challenges in precise estimation of the C-function, particularly due to the fluctuations in the constrained Lagrange multiplier. Addressing this gap, our study focuses on the nuances of C-value estimation in off-policy CRL and introduces the Adaptive Ensemble C-learning (AEC) approach to reduce these inaccuracies. Building on state-of-the-art off-policy algorithms, we propose AEC-based CRL algorithms designed for enhanced task optimization. Extensive experiments on nine constrained robotics tasks reveal the superior interaction efficiency and performance of our algorithms in comparison to preceding methods. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain b/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain new file mode 100644 index 0000000000..2ddfc6cf9e --- /dev/null +++ b/data/2024/aaai/Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain @@ -0,0 +1 @@ +RAW to sRGB mapping, which aims to convert RAW images from smartphones into RGB form equivalent to that of Digital Single-Lens Reflex (DSLR) cameras, has become an important area of research. However, current methods often ignore the difference between cell phone RAW images and DSLR camera RGB images, a difference that goes beyond the color matrix and extends to spatial structure due to resolution variations. Recent methods directly rebuild color mapping and spatial structure via shared deep representation, limiting optimal performance. Inspired by Image Signal Processing (ISP) pipeline, which distinguishes image restoration and enhancement, we present a novel Neural ISP framework, named FourierISP. This approach breaks the image down into style and structure within the frequency domain, allowing for independent optimization. FourierISP is comprised of three subnetworks: Phase Enhance Subnet for structural refinement, Amplitude Refine Subnet for color learning, and Color Adaptation Subnet for blending them in a smooth manner. This approach sharpens both color and structure, and extensive evaluations across varied datasets confirm that our approach realizes state-of-the-art results. Code will be available at https://github.com/alexhe101/FourierISP. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning b/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning new file mode 100644 index 0000000000..c76536f093 --- /dev/null +++ b/data/2024/aaai/Enhancing Representation of Spiking Neural Networks via Similarity-Sensitive Contrastive Learning @@ -0,0 +1 @@ +Spiking neural networks (SNNs) have attracted intensive attention as a promising energy-efficient alternative to conventional artificial neural networks (ANNs) recently, which could transmit information in form of binary spikes rather than continuous activations thus the multiplication of activation and weight could be replaced by addition to save energy. However, the binary spike representation form will sacrifice the expression performance of SNNs and lead to accuracy degradation compared with ANNs. Considering improving feature representation is beneficial to training an accurate SNN model, this paper focuses on enhancing the feature representation of the SNN. To this end, we establish a similarity-sensitive contrastive learning framework, where SNN could capture significantly more information from its ANN counterpart to improve representation by Mutual Information (MI) maximization with layer-wise sensitivity to similarity. In specific, it enriches the SNN’s feature representation by pulling the positive pairs of SNN's and ANN's feature representation of each layer from the same input samples closer together while pushing the negative pairs from different samples further apart. Experimental results show that our method consistently outperforms the current state-of-the-art algorithms on both popular non-spiking static and neuromorphic datasets. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities b/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities new file mode 100644 index 0000000000..3b5f486653 --- /dev/null +++ b/data/2024/aaai/Enhancing Robotics with Cognitive Capabilities @@ -0,0 +1 @@ +In the pursuit of creating more effective and adaptable robots, the flourishing field of cognitive robotics has arisen to infuse machines with human-like cognitive functions. This paper delves into the significance of cognitive robotics and charts a course for empowering robots with advanced cognitive capabilities. Drawing inspiration from current research in cognitive architectures, the paper underscores the importance of refined perception, language processing, complex decision-making, emotional intelligence, and cognitive synergy. By integrating these cognitive functions into robotic systems, the goal is to equip robots to operate intelligently in dynamic environments, collaborate seamlessly with humans, and adeptly handle diverse tasks. The proposed enhancements mark crucial strides towards the development of more versatile and capable intelligent robots. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling b/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling new file mode 100644 index 0000000000..cdcfaa8276 --- /dev/null +++ b/data/2024/aaai/Enhancing Semi-supervised Domain Adaptation via Effective Target Labeling @@ -0,0 +1 @@ +Existing semi-supervised domain adaptation (SSDA) models have exhibited impressive performance on the target domain by effectively utilizing few labeled target samples per class (e.g., 3 samples per class). To guarantee an equal number of labeled target samples for each class, however, they require domain experts to manually recognize a considerable amount of the unlabeled target data. Moreover, as the target samples are not equally informative for shaping the decision boundaries of the learning models, it is crucial to select the most informative target samples for labeling, which is, however, impossible for human selectors. As a remedy, we propose an EFfective Target Labeling (EFTL) framework that harnesses active learning and pseudo-labeling strategies to automatically select some informative target samples to annotate. Concretely, we introduce a novel sample query strategy, called non-maximal degree node suppression (NDNS), that iteratively performs maximal degree node query and non-maximal degree node removal to select representative and diverse target samples for labeling. To learn target-specific characteristics, we propose a novel pseudo-labeling strategy that attempts to label low-confidence target samples accurately via clustering consistency (CC), and then inject information of the model uncertainty into our query process. CC enhances the utilization of the annotation budget and increases the number of “labeled” target samples while requiring no additional manual effort. Our proposed EFTL framework can be easily coupled with existing SSDA models, showing significant improvements on three benchmarks \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy b/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy new file mode 100644 index 0000000000..1698f00758 --- /dev/null +++ b/data/2024/aaai/Enhancing Student Performance Prediction on Learnersourced Questions with SGNN-LLM Synergy @@ -0,0 +1,2 @@ +Learnersourcing offers great potential for scalable education through student content creation. However, predicting student performance on learnersourced questions, which is essential for personalizing the learning experience, is challenging due to the inherent noise in student-generated data. Moreover, while conventional graph-based methods can capture the complex network of student and question interactions, they often fall short under cold start conditions where limited student engagement with questions yields sparse data. To address both challenges, we introduce an innovative strategy that synergizes the potential of integrating Signed Graph Neural Networks (SGNNs) and Large Language Model (LLM) embeddings. Our methodology employs a signed bipartite graph to comprehensively model student answers, complemented by a contrastive learning framework that enhances noise resilience. Furthermore, LLM's contribution lies in generating foundational question embeddings, proving especially advantageous in addressing cold start scenarios characterized by limited graph data. +Validation across five real-world datasets sourced from the PeerWise platform underscores our approach's effectiveness. Our method outperforms baselines, showcasing enhanced predictive accuracy and robustness. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency b/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency new file mode 100644 index 0000000000..26264f44d3 --- /dev/null +++ b/data/2024/aaai/Enhancing Training of Spiking Neural Network with Stochastic Latency @@ -0,0 +1 @@ +Spiking neural networks (SNNs) have garnered significant attention for their low power consumption when deployed on neuromorphic hardware that operates in orders of magnitude lower power than general-purpose hardware. Direct training methods for SNNs come with an inherent latency for which the SNNs are optimized, and in general, the higher the latency, the better the predictive powers of the models, but at the same time, the higher the energy consumption during training and inference. Furthermore, an SNN model optimized for one particular latency does not necessarily perform well in lower latencies, which becomes relevant in scenarios where it is necessary to switch to a lower latency because of the depletion of onboard energy or other operational requirements. In this work, we propose Stochastic Latency Training (SLT), a direct training method for SNNs that optimizes the model for the given latency but simultaneously offers a minimum reduction of predictive accuracy when shifted to lower inference latencies. We provide heuristics for our approach with partial theoretical justification and experimental evidence showing the state-of-the-art performance of our models on datasets such as CIFAR-10, DVS-CIFAR-10, CIFAR-100, and DVS-Gesture. Our code is available at https://github.com/srinuvaasu/SLT \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract) b/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract) new file mode 100644 index 0000000000..18be3cb3fd --- /dev/null +++ b/data/2024/aaai/Enhancing Transcription Factor Prediction through Multi-Task Learning (Student Abstract) @@ -0,0 +1 @@ +Transcription factors (TFs) play a fundamental role in gene regulation by selectively binding to specific DNA sequences. Understanding the nature and behavior of these TFs is essential for insights into gene regulation dynamics. In this study, we introduce a robust multi-task learning framework specifically tailored to harness both TF-specific annotations and TF-related domain annotations, thereby enhancing the accuracy of TF predictions. Notably, we incorporate cutting-edge language models that have recently garnered attention for their outstanding performance across various fields, particularly in biological computations like protein sequence modeling. Comparative experimental analysis with existing models, DeepTFactor and TFpredict, reveals that our multi-task learning framework achieves an accuracy exceeding 92% across four evaluation metrics on the TF prediction task, surpassing both competitors. Our work marks a significant leap in the domain of TF prediction, enriching our comprehension of gene regulatory mechanisms and paving the way for the discovery of novel regulatory motifs. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations b/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations new file mode 100644 index 0000000000..f468604a79 --- /dev/null +++ b/data/2024/aaai/Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations @@ -0,0 +1 @@ +Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-à-vis alternative baseline models. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling b/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling new file mode 100644 index 0000000000..a554adbebc --- /dev/null +++ b/data/2024/aaai/Enhancing the Efficiency of Altruism and Taxes in Affine Congestion Games through Signalling @@ -0,0 +1 @@ +We address the problem of improving the worst-case efficiency of pure Nash equilibria (aka, the price of anarchy) in affine congestion games, through a novel use of signalling. We assume that, for each player in the game, a most preferred strategy is publicly signalled. This can be done either distributedly by the players themselves, or be the outcome of some centralized algorithm. We apply this signalling scheme to two well-studied scenarios: games with partially altruistic players and games with resource taxation. We show a significant improvement in the price of anarchy of these games, whenever the aggregate signalled strategy profile is a good approximation of the game social optimum. \ No newline at end of file diff --git a/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms b/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms new file mode 100644 index 0000000000..ab499e4d01 --- /dev/null +++ b/data/2024/aaai/Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms @@ -0,0 +1 @@ +Spiking neural networks (SNNs) exploit neural spikes to provide solutions for low-power intelligent applications on neuromorphic hardware. Although SNNs have high computational efficiency due to spiking communication, they still lack resistance to adversarial attacks and noise perturbations. In the brain, neuronal responses generally possess stochasticity induced by ion channels and synapses, while the role of stochasticity in computing tasks is poorly understood. Inspired by this, we elaborate a stochastic gating spiking neural model for layer-by-layer spike communication, introducing stochasticity to SNNs. Through theoretical analysis, our gating model can be viewed as a regularizer that prevents error amplification under attacks. Meanwhile, our work can explain the robustness of Poisson coding. Experimental results prove that our method can be used alone or with existing robust enhancement algorithms to improve SNN robustness and reduce SNN energy consumption. We hope our work will shed new light on the role of stochasticity in the computation of SNNs. Our code is available at https://github.com/DingJianhao/StoG-meets-SNN/. \ No newline at end of file diff --git a/data/2024/aaai/Entropic Open-Set Active Learning b/data/2024/aaai/Entropic Open-Set Active Learning new file mode 100644 index 0000000000..08a367aeab --- /dev/null +++ b/data/2024/aaai/Entropic Open-Set Active Learning @@ -0,0 +1 @@ +Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at https://github.com/bardisafa/EOAL. \ No newline at end of file diff --git a/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks b/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks new file mode 100644 index 0000000000..110e541d94 --- /dev/null +++ b/data/2024/aaai/Entropy Induced Pruning Framework for Convolutional Neural Networks @@ -0,0 +1 @@ +Structured pruning techniques have achieved great compression performance on convolutional neural networks for image classification tasks. However, the majority of existing methods are sensitive with respect to the model parameters, and their pruning results may be unsatisfactory when the original model is trained poorly. That is, they need the original model to be fully trained, to obtain useful weight information. This is time-consuming, and makes the effectiveness of the pruning results dependent on the degree of model optimization. To address the above issue, we propose a novel metric named Average Filter Information Entropy (AFIE). It decomposes the weight matrix of each layer into a low-rank space, and quantifies the filter importance based on the distribution of the normalized eigenvalues. Intuitively, the eigenvalues capture the covariance among filters, and therefore could be a good guide for pruning. Since the distribution of eigenvalues is robust to the updating of parameters, AFIE can yield a stable evaluation for the importance of each filter no matter whether the original model is trained fully. We implement our AFIE-based pruning method for three popular CNN models of AlexNet, VGG-16, and ResNet-50, and test them on three widely-used image datasets MNIST, CIFAR-10, and ImageNet, respectively. The experimental results are encouraging. We surprisingly observe that for our methods, even when the original model is trained with only one epoch, the AFIE score of each filter keeps identical to the results when the model is fully-trained. This fully indicates the effectiveness of the proposed pruning method. \ No newline at end of file diff --git a/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees b/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees new file mode 100644 index 0000000000..b39007cf36 --- /dev/null +++ b/data/2024/aaai/Enumerating Safe Regions in Deep Neural Networks with Provable Probabilistic Guarantees @@ -0,0 +1 @@ +Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNN-Verification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called ε-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight —with provable probabilistic guarantees— lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs. \ No newline at end of file diff --git a/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences b/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences new file mode 100644 index 0000000000..fa8e8ca215 --- /dev/null +++ b/data/2024/aaai/Envy-Free House Allocation under Uncertain Preferences @@ -0,0 +1 @@ +Envy-freeness is one of the most important fairness concerns when allocating items. We study envy-free house allocation when agents have uncertain preferences over items and consider several well-studied preference uncertainty models. The central problem that we focus on is computing an allocation that has the highest probability of being envy-free. We show that each model leads to a distinct set of algorithmic and complexity results, including detailed results on (in-)approximability. En route, we consider two related problems of checking whether there exists an allocation that is possibly or necessarily envy-free. We give a complete picture of the computational complexity of these two problems for all the uncertainty models we consider. \ No newline at end of file diff --git a/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward b/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward new file mode 100644 index 0000000000..b007ad2326 --- /dev/null +++ b/data/2024/aaai/Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward @@ -0,0 +1 @@ +Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster. \ No newline at end of file diff --git a/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context b/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context new file mode 100644 index 0000000000..f7af5e5dd0 --- /dev/null +++ b/data/2024/aaai/Equity-Transformer: Solving NP-Hard Min-Max Routing Problems as Sequential Generation with Equity Context @@ -0,0 +1 @@ +Min-max routing problems aim to minimize the maximum tour length among multiple agents as they collaboratively visit all cities, i.e., the completion time. These problems include impactful real-world applications but are known as NP-hard. Existing methods are facing challenges, particularly in large-scale problems that require the coordination of numerous agents to cover thousands of cities. This paper proposes Equity-Transformer to solve large-scale min-max routing problems. First, we model min-max routing problems into sequential planning, reducing the complexity and enabling the use of a powerful Transformer architecture. Second, we propose key inductive biases that ensure equitable workload distribution among agents. The effectiveness of Equity-Transformer is demonstrated through its superior performance in two representative min-max routing tasks: the min-max multi-agent traveling salesman problem (min-max mTSP) and the min-max multi-agent pick-up and delivery problem (min-max mPDP). Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1,000 cities of mTSP. We provide reproducible source code: https://github.com/kaist-silab/equity-transformer. \ No newline at end of file diff --git a/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract) b/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract) new file mode 100644 index 0000000000..db53c5f2a6 --- /dev/null +++ b/data/2024/aaai/Equivalence between Graph Spectral Clustering and Column Subset Selection (Student Abstract) @@ -0,0 +1 @@ +The common criteria for evaluating spectral clustering are NCut and RatioCut. The seemingly unrelated column subset selection (CSS) problem aims to compute a column subset that linearly approximates the entire matrix. A common criterion is the approximation error in the Frobenius norm (ApproxErr). We show that any algorithm for CSS can be viewed as a clustering algorithm that minimizes NCut by applying it to a matrix formed from graph edges. Conversely, any clustering algorithm can be seen as identifying a column subset from that matrix. In both cases, ApproxErr and NCut have the same value. Analogous results hold for RatioCut with a slightly different matrix. Therefore, established results for CSS can be mapped to spectral clustering. We use this to obtain new clustering algorithms, including an optimal one that is similar to A*. This is the first nontrivial clustering algorithm with such an optimality guarantee. A variant of the weighted A* runs much faster and provides bounds on the accuracy. Finally, we use the results from spectral clustering to prove the NP-hardness of CSS from sparse matrices. \ No newline at end of file diff --git a/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data b/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data new file mode 100644 index 0000000000..2676c2f58c --- /dev/null +++ b/data/2024/aaai/Estimating On-Road Transportation Carbon Emissions from Open Data of Road Network and Origin-Destination Flow Data @@ -0,0 +1 @@ +Accounting for over 20% of the total carbon emissions, the precise estimation of on-road transportation carbon emissions is crucial for carbon emission monitoring and efficient mitigation policy formulation. However, existing estimation methods typically depend on hard-to-collect individual statistics of vehicle miles traveled to calculate emissions, thereby suffering from high data collection difficulty. To relieve this issue by utilizing the strong pattern recognition of artificial intelligence, we incorporate two sources of open data representative of the transportation demand and capacity factors, the origin-destination (OD) flow data and the road network data, to build a hierarchical heterogeneous graph learning method for on-road carbon emission estimation (HENCE). Specifically, a hierarchical graph consisting of the road network level, community level, and region level is constructed to model the multi-scale road network-based connectivity and travel connection between spatial areas. Heterogeneous graphs consisting of OD links and spatial links are further built at both the community level and region level to capture the intrinsic interactions between travel demand and road network accessibility. Extensive experiments on two large-scale real-world datasets demonstrate HENCE's effectiveness and superiority with R-squared exceeding 0.75 and outperforming baselines by 9.60% on average, validating its success in pioneering the use of artificial intelligence to empower carbon emission management and sustainability development. The implementation codes are available at this link: https://github.com/tsinghua-fib-lab/HENCE. \ No newline at end of file diff --git a/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer b/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer new file mode 100644 index 0000000000..b6b5411234 --- /dev/null +++ b/data/2024/aaai/EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer @@ -0,0 +1 @@ +Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer. \ No newline at end of file diff --git a/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior b/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior new file mode 100644 index 0000000000..835f85e013 --- /dev/null +++ b/data/2024/aaai/Evaluate Geometry of Radiance Fields with Low-Frequency Color Prior @@ -0,0 +1,2 @@ +A radiance field is an effective representation of 3D scenes, which has been widely adopted in novel-view synthesis and 3D reconstruction. It is still an open and challenging problem to evaluate the geometry, i.e., the density field, as the ground-truth is almost impossible to obtain. One alternative indirect solution is to transform the density field into a point-cloud and compute its Chamfer Distance with the scanned ground-truth. However, many widely-used datasets have no point-cloud ground-truth since the scanning process along with the equipment is expensive and complicated. +To this end, we propose a novel metric, named Inverse Mean Residual Color (IMRC), which can evaluate the geometry only with the observation images. Our key insight is that the better the geometry, the lower-frequency the computed color field. From this insight, given a reconstructed density field and observation images, we design a closed-form method to approximate the color field with low-frequency spherical harmonics, and compute the inverse mean residual color. Then the higher the IMRC, the better the geometry. Qualitative and quantitative experimental results verify the effectiveness of our proposed IMRC metric. We also benchmark several state-of-the-art methods using IMRC to promote future related research. Our code is available at https://github.com/qihangGH/IMRC. \ No newline at end of file diff --git a/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse b/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse new file mode 100644 index 0000000000..8f095983b4 --- /dev/null +++ b/data/2024/aaai/Evaluating AI Red Teaming's Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse @@ -0,0 +1 @@ +This research explores the discourse surrounding red teaming and aims to identify any themes in the online discussion of potential environmental harms stemming from Large Language Models (LLMs). Focusing on the AI Red Teaming event at DEFCON 31, this study employs reflexive thematic analysis on diverse social networking site sources to extract insights into public discussion of LLM red teaming and its environmental implications. The findings intend to inform future research, highlighting the need for responsible AI development that addresses environmental concerns. \ No newline at end of file diff --git a/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference b/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference new file mode 100644 index 0000000000..d4c7c2dcc4 --- /dev/null +++ b/data/2024/aaai/Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference @@ -0,0 +1 @@ +After a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pre-trial programs' are common throughout the United States, but very little research has demonstrated their effectiveness. Researchers have emphasized the need for more rigorous program evaluation methods, which we introduce in this article. We describe a program evaluation pipeline that uses recent interpretable machine learning techniques for observational causal inference, and demonstrate these techniques in a study of a pre-trial program in Durham, North Carolina. Our findings show no evidence that the program either significantly increased or decreased the probability of new criminal charges. If these findings replicate, the criminal-legal system needs to either improve pre-trial programs or consider alternatives to them. The simplest option is to release low-risk individuals back into the community without subjecting them to any restrictions or conditions. Another option is to assign individuals to pre-trial programs that incentivize pro-social behavior. We believe that the techniques introduced here can provide researchers the rigorous tools they need to evaluate these programs. \ No newline at end of file diff --git a/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract) b/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract) new file mode 100644 index 0000000000..b3e46e49ac --- /dev/null +++ b/data/2024/aaai/Evaluating the Effectiveness of Explainable Artificial Intelligence Approaches (Student Abstract) @@ -0,0 +1 @@ +Explainable Artificial Intelligence (XAI), a promising future technology in the field of healthcare, has attracted significant interest. Despite ongoing efforts in the development of XAI approaches, there has been inadequate evaluation of explanation effectiveness and no standardized framework for the evaluation has been established. This study aims to examine the relationship between subjective interpretability and perceived plausibility for various XAI explanations and to determine the factors affecting users' acceptance of the XAI explanation. \ No newline at end of file diff --git a/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract) b/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract) new file mode 100644 index 0000000000..3211073722 --- /dev/null +++ b/data/2024/aaai/Evaluating the Efficacy of Prompting Techniques for Debiasing Language Model Outputs (Student Abstract) @@ -0,0 +1 @@ +Achieving fairness in Large Language Models (LLMs) continues to pose a persistent challenge, as these models are prone to inheriting biases from their training data, which can subsequently impact their performance in various applications. There is a need to systematically explore whether structured prompting techniques can offer opportunities for debiased text generation by LLMs. In this work, we designed an evaluative framework to test the efficacy of different prompting techniques for debiasing text along different dimensions. We aim to devise a general structured prompting approach to achieve fairness that generalizes well to different texts and LLMs. \ No newline at end of file diff --git a/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract) b/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract) new file mode 100644 index 0000000000..45cfe94f62 --- /dev/null +++ b/data/2024/aaai/Evaluation of Large Language Models on Code Obfuscation (Student Abstract) @@ -0,0 +1 @@ +Obfuscation intends to decrease interpretability of code and identification of code behavior. Large Language Models(LLMs) have been proposed for code synthesis and code analysis. This paper attempts to understand how well LLMs can analyse code and identify code behavior. Specifically, this paper systematically evaluates several LLMs’ capabilities to detect obfuscated code and identify behavior across a variety of obfuscation techniques with varying levels of complexity. LLMs proved to be better at detecting obfuscations that changed identifiers, even to misleading ones, compared to obfuscations involving code insertions (unused variables, as well as variables that replace constants with expressions that evaluate to those constants). Hardest to detect were obfuscations that layered multiple simple transformations. For these, only 20-40% of the LLMs’ responses were correct. Adding misleading documentation was also successful in misleading LLMs. We provide all our code to replicate results at https://github.com/SwindleA/LLMCodeObfuscation. Overall, our results suggest a gap in LLMs’ ability to understand code. \ No newline at end of file diff --git a/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering b/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering new file mode 100644 index 0000000000..11f0f51032 --- /dev/null +++ b/data/2024/aaai/Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering @@ -0,0 +1 @@ +Attributed graph clustering is an unsupervised task that partitions nodes into different groups. Self-supervised learning (SSL) shows great potential in handling this task, and some recent studies simultaneously learn multiple SSL tasks to further boost performance. Currently, different SSL tasks are assigned the same set of weights for all graph nodes. However, we observe that some graph nodes whose neighbors are in different groups require significantly different emphases on SSL tasks. In this paper, we propose to dynamically learn the weights of SSL tasks for different nodes and fuse the embeddings learned from different SSL tasks to boost performance. We design an innovative graph clustering approach, namely Dynamically Fusing Self-Supervised Learning (DyFSS). Specifically, DyFSS fuses features extracted from diverse SSL tasks using distinct weights derived from a gating network. To effectively learn the gating network, we design a dual-level self-supervised strategy that incorporates pseudo labels and the graph structure. Extensive experiments on five datasets show that DyFSS outperforms the state-of-the-art multi-task SSL methods by up to 8.66% on the accuracy metric. The code of DyFSS is available at: https://github.com/q086/DyFSS. \ No newline at end of file diff --git a/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis b/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis new file mode 100644 index 0000000000..33c6eb8400 --- /dev/null +++ b/data/2024/aaai/Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis @@ -0,0 +1 @@ +In the dynamic field of film and game development, the emergence of human motion synthesis methods has revolutionized avatar animation. Traditional methodologies, typically reliant on single modality inputs like text or audio, employ modality-specific model frameworks, posing challenges for unified model deployment and application. To address this, we propose Everything2Motion, a unified model framework. Everything2Motion consists of three key modules. The Input-Output Modality Modulation module tailors structures for specific multimodal inputs, eliminating the need for modality-specific frameworks. The Query-aware Autoencoder, based on the transformer encoder-decoder architecture, enables efficient latent motion generation. Lastly, the Prior Motion Distillation Decoder, a pretrained module, enhances the final skeleton sequence's naturalness and fluidity. Comprehensive experiments on several public datasets demonstrate the effectiveness of Everything2Motion, highlighting its potential for practical applications and setting a new benchmark in human motion synthesis. \ No newline at end of file diff --git a/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images b/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images new file mode 100644 index 0000000000..2590598812 --- /dev/null +++ b/data/2024/aaai/Evidential Uncertainty-Guided Mitochondria Segmentation for 3D EM Images @@ -0,0 +1 @@ +Recent advances in deep learning have greatly improved the segmentation of mitochondria from Electron Microscopy (EM) images. However, suffering from variations in mitochondrial morphology, imaging conditions, and image noise, existing methods still exhibit high uncertainty in their predictions. Moreover, in view of our findings, predictions with high levels of uncertainty are often accompanied by inaccuracies such as ambiguous boundaries and amount of false positive segments. To deal with the above problems, we propose a novel approach for mitochondria segmentation in 3D EM images that leverages evidential uncertainty estimation, which for the first time integrates evidential uncertainty to enhance the performance of segmentation. To be more specific, our proposed method not only provides accurate segmentation results, but also estimates associated uncertainty. Then, the estimated uncertainty is used to help improve the segmentation performance by an uncertainty rectification module, which leverages uncertainty maps and multi-scale information to refine the segmentation. Extensive experiments conducted on four challenging benchmarks demonstrate the superiority of our proposed method over existing approaches. \ No newline at end of file diff --git a/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning b/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning new file mode 100644 index 0000000000..cabf6c3625 --- /dev/null +++ b/data/2024/aaai/Evolving Parameterized Prompt Memory for Continual Learning @@ -0,0 +1 @@ +Recent studies have demonstrated the potency of leveraging prompts in Transformers for continual learning (CL). Nevertheless, employing a discrete key-prompt bottleneck can lead to selection mismatches and inappropriate prompt associations during testing. Furthermore, this approach hinders adaptive prompting due to the lack of shareability among nearly identical instances at more granular level. To address these challenges, we introduce the Evolving Parameterized Prompt Memory (EvoPrompt), a novel method involving adaptive and continuous prompting attached to pre-trained Vision Transformer (ViT), conditioned on specific instance. We formulate a continuous prompt function as a neural bottleneck and encode the collection of prompts on network weights. We establish a paired prompt memory system consisting of a stable reference and a flexible working prompt memory. Inspired by linear mode connectivity, we progressively fuse the working prompt memory and reference prompt memory during inter-task periods, resulting in continually evolved prompt memory. This fusion involves aligning functionally equivalent prompts using optimal transport and aggregating them in parameter space with an adjustable bias based on prompt node attribution. Additionally, to enhance backward compatibility, we propose compositional classifier initialization, which leverages prior prototypes from pre-trained models to guide the initialization of new classifiers in a subspace-aware manner. Comprehensive experiments validate that our approach achieves state-of-the-art performance in both class and domain incremental learning scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Exact ASP Counting with Compact Encodings b/data/2024/aaai/Exact ASP Counting with Compact Encodings new file mode 100644 index 0000000000..4e9c5cc070 --- /dev/null +++ b/data/2024/aaai/Exact ASP Counting with Compact Encodings @@ -0,0 +1,26 @@ +Answer Set Programming (ASP) has emerged as a promising +paradigm in knowledge representation and automated reason- +ing owing to its ability to model hard combinatorial problems +from diverse domains in a natural way. Building on advances +in propositional SAT solving, the past two decades have wit- +nessed the emergence of well-engineered systems for solv- +ing the answer set satisfiability problem, i.e., finding mod- +els or answer sets for a given answer set program. In re- +cent years, there has been growing interest in problems be- +yond satisfiability, such as model counting, in the context of +ASP. Akin to the early days of propositional model count- +ing, state-of-the-art exact answer set counters do not scale +well beyond small instances. Exact ASP counters struggle +with handling larger input formulas. The primary contribu- +tion of this paper is a new ASP counting framework, called +sharpASP, which counts answer sets avoiding larger input +formulas. This relies on an alternative way of defining answer +sets that allows lifting of key techniques developed in the con- +text of propositional model counting. Our extensive empirical +analysis over 1470 benchmarks demonstrates significant per- +formance gain over current state-of-the-art exact answer set +counters. Specifically, by using sharpASP, we were able to +solve 1062 benchmarks with PAR2 score of 3082 whereas +using prior state-of-the-art, we could only solve 895 bench- +marks with PAR2 score of 4205, all other experimental con- +ditions being the same. \ No newline at end of file diff --git a/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology b/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology new file mode 100644 index 0000000000..62aea62d4b --- /dev/null +++ b/data/2024/aaai/Exact Algorithms and Lowerbounds for Multiagent Path Finding: Power of Treelike Topology @@ -0,0 +1,11 @@ +In the Multiagent Path Finding (MAPF for short) problem, we focus on efficiently finding non-colliding paths for a set of k agents on a given graph G, where each agent seeks a path from its source vertex to a target. +An important measure of the quality of the solution is the length of the proposed schedule l, that is, the length of a longest path (including the waiting time). +In this work, we propose a systematic study under the parameterized complexity framework. The hardness results we provide align with many heuristics used for this problem, whose running time could potentially be improved based on our Fixed-Parameter Tractability (FPT) results. + +We show that MAPF is W[1]-hard with respect to k (even if k is combined with the maximum degree of the input graph). +The problem remains NP-hard in planar graphs even if the maximum degree and the makespan l are fixed constants. +On the positive side, we show an FPT algorithm for k+l. + +As we continue, the structure of G comes into play. +We give an FPT algorithm for parameter k plus the diameter of the graph G. +The MAPF problem is W[1]-hard for cliquewidth of G plus l while it is FPT for treewidth of G plus l. \ No newline at end of file diff --git a/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics b/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics new file mode 100644 index 0000000000..d8290004b4 --- /dev/null +++ b/data/2024/aaai/Exact Inference for Continuous-Time Gaussian Process Dynamics @@ -0,0 +1 @@ +Many physical systems can be described as a continuous-time dynamical system. In practice, the true system is often unknown and has to be learned from measurement data. Since data is typically collected in discrete time, e.g. by sensors, most methods in Gaussian process (GP) dynamics model learning are trained on one-step ahead predictions. While this scheme is mathematically tempting, it can become problematic in several scenarios, e.g. if measurements are provided at irregularly-sampled time steps or physical system properties have to be conserved. Thus, we aim for a GP model of the true continuous-time dynamics. We tackle this task by leveraging higher-order numerical integrators. These integrators provide the necessary tools to discretize dynamical systems with arbitrary accuracy. However, most higher-order integrators require dynamics evaluations at intermediate time steps, making exact GP inference intractable. In previous work, this problem is often addressed by approximate inference techniques. However, exact GP inference is preferable in many scenarios, e.g. due to its mathematical guarantees. In order to enable direct inference, we propose to leverage multistep and Taylor integrators. We demonstrate how exact inference schemes can be derived for these types of integrators. Further, we derive tailored sampling schemes that allow one to draw consistent dynamics functions from the posterior. The learned model can thus be integrated with arbitrary integrators, just like a standard dynamical system. We show empirically and theoretically that our approach yields an accurate representation of the continuous-time system. \ No newline at end of file diff --git a/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption b/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption new file mode 100644 index 0000000000..bc29556ddd --- /dev/null +++ b/data/2024/aaai/Exact Policy Recovery in Offline RL with Both Heavy-Tailed Rewards and Data Corruption @@ -0,0 +1 @@ +We study offline reinforcement learning (RL) with heavy-tailed reward distribution and data corruption: (i) Moving beyond subGaussian reward distribution, we allow the rewards to have infinite variances; (ii) We allow corruptions where an attacker can arbitrarily modify a small fraction of the rewards and transitions in the dataset. We first derive a sufficient optimality condition for generalized Pessimistic Value Iteration (PEVI), which allows various estimators with proper confidence bounds and can be applied to multiple learning settings. In order to handle the data corruption and heavy-tailed reward setting, we prove that the trimmed-mean estimation achieves the minimax optimal error rate for robust mean estimation under heavy-tailed distributions. In the PEVI algorithm, we plug in the trimmed mean estimation and the confidence bound to solve the robust offline RL problem. Standard analysis reveals that data corruption induces a bias term in the suboptimality gap, which gives the false impression that any data corruption prevents optimal policy learning. By using the optimality condition for the generalized PEVI, we show that as long as the bias term is less than the ``action gap'', the policy returned by PEVI achieves the optimal value given sufficient data. \ No newline at end of file diff --git a/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families b/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families new file mode 100644 index 0000000000..186396c1d6 --- /dev/null +++ b/data/2024/aaai/Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families @@ -0,0 +1,7 @@ +We introduce squared neural Poisson point processes (SNEPPPs) by parameterising the intensity function by the squared norm of a two layer neural network. +When the hidden layer is fixed and the second layer has a single neuron, our approach resembles previous uses of squared Gaussian process or kernel methods, but allowing the hidden layer to be learnt allows for additional flexibility. +In many cases of interest, the integrated intensity function admits a closed form and can be computed in quadratic time in the number of hidden neurons. +We enumerate a far more extensive number of such cases than has previously been discussed. +Our approach is more memory and time efficient than naive implementations of squared or exponentiated kernel methods or Gaussian processes. +Maximum likelihood and maximum a posteriori estimates in a reparameterisation of the final layer of the intensity function can be obtained by solving a (strongly) convex optimisation problem using projected gradient descent. +We demonstrate SNEPPPs on real, and synthetic benchmarks, and provide a software implementation. \ No newline at end of file diff --git a/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration b/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration new file mode 100644 index 0000000000..52525821cb --- /dev/null +++ b/data/2024/aaai/Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration @@ -0,0 +1,2 @@ +Human motion prediction is consisting in forecasting future body poses from historically observed sequences. It is a longstanding challenge due to motion's complex dynamics and uncertainty. Existing methods focus on building up complicated neural networks to model the motion dynamics. The predicted results are required to be strictly similar to the training samples with L2 loss in current training pipeline. However, little attention has been paid to the uncertainty property which is crucial to the prediction task. We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result. In addition, existing works calculate the predicted error on each future frame equally during training, while recent work indicated that different frames could play different roles. In this work, a novel computationally efficient encoder-decoder model with uncertainty consideration is proposed, which could learn proper characteristics for future frames by a dynamic function. Experimental results on benchmark datasets demonstrate that our uncertainty consideration approach has obvious advantages both in quantity and quality. Moreover, the proposed method could produce motion sequences with much better quality that avoids the intractable shaking artefacts. We believe our work could provide a novel perspective to consider the uncertainty quality for the general motion prediction task and encourage the studies in this field. The code will be available in +https://github.com/Motionpre/Adaptive-Salient-Loss-SAGGB. \ No newline at end of file diff --git a/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment b/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment new file mode 100644 index 0000000000..41b515ff9c --- /dev/null +++ b/data/2024/aaai/ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment @@ -0,0 +1,5 @@ +The objective of stylized speech-driven facial animation is to create animations that encapsulate specific emotional expressions. Existing methods often depend on pre-established emotional labels or facial expression templates, which may limit the necessary flexibility for accurately conveying user intent. +In this research, we introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts. This technique presents benefits in terms of both flexibility and user-friendliness. +To realize this objective, we initially construct a Text-Expression Alignment Dataset (TEAD), wherein each facial expression is paired with several prompt-like descriptions. We propose an innovative automatic annotation method, supported by CahtGPT, to expedite the dataset construction, thereby eliminating the substantial expense of manual annotation. +Following this, we utilize TEAD to train a CLIP-based model, termed ExpCLIP, which encodes text and facial expressions into semantically aligned style embeddings. The embeddings are subsequently integrated into the facial animation generator to yield expressive and controllable facial animations. Given the limited diversity of facial emotions in existing speech-driven facial animation training data, we further introduce an effective Expression Prompt Augmentation (EPA) mechanism to enable the animation generator to support unprecedented richness in style control. +Comprehensive experiments illustrate that our method accomplishes expressive facial animation generation and offers enhanced flexibility in effectively conveying the desired style. \ No newline at end of file diff --git a/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization b/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization new file mode 100644 index 0000000000..17dc5bcb95 --- /dev/null +++ b/data/2024/aaai/Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization @@ -0,0 +1,6 @@ +Unsupervised semantic segmentation (USS) aims to discover and recognize meaningful categories without any labels. +For a successful USS, two key abilities are required: 1) information compression and 2) clustering capability. +Previous methods have relied on feature dimension reduction for information compression, however, this approach may hinder the process of clustering. +In this paper, we propose a novel USS framework called Expand-and-Quantize Unsupervised Semantic Segmentation (EQUSS), which combines the benefits of high-dimensional spaces for better clustering and product quantization for effective information compression. +Our extensive experiments demonstrate that EQUSS achieves state-of-the-art results on three standard benchmarks. +In addition, we analyze the entropy of USS features, which is the first step towards understanding USS from the perspective of information theory. \ No newline at end of file diff --git a/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners b/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners new file mode 100644 index 0000000000..cfe6b19f5d --- /dev/null +++ b/data/2024/aaai/ExpeL: LLM Agents Are Experiential Learners @@ -0,0 +1 @@ +The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments. \ No newline at end of file diff --git a/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders b/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders new file mode 100644 index 0000000000..a9ddc2ed99 --- /dev/null +++ b/data/2024/aaai/Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders @@ -0,0 +1 @@ +Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-text pairs. In this paper, our goal is to resolve the misalignment problem with an efficient distillation framework. To this end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with Self-distilled Encoders. ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. This strategic design choice enables the distillation to operate within a unified projected space of text embedding, resulting in better performance. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder. Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed. \ No newline at end of file diff --git a/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract) b/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract) new file mode 100644 index 0000000000..a769ea61de --- /dev/null +++ b/data/2024/aaai/Explainable Earnings Call Representation Learning (Student Abstract) @@ -0,0 +1 @@ +Earnings call transcripts hold valuable insights that are vital for investors and analysts when making informed decisions. However, extracting these insights from lengthy and complex transcripts can be a challenging task. The traditional manual examination is not only time-consuming but also prone to errors and biases. Deep learning-based representation learning methods have emerged as promising and automated approaches to tackle this problem. Nevertheless, they may encounter significant challenges, such as the unreliability of the representation encoding process and certain domain-specific requirements in the context of finance. To address these issues, we propose a novel transcript representation learning model. Our model leverages the structural information of transcripts to effectively extract key insights, while endowing model with explainability via variational information bottleneck. Extensive experiments on two downstream financial tasks demonstrate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder b/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder new file mode 100644 index 0000000000..6492a23ee2 --- /dev/null +++ b/data/2024/aaai/Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder @@ -0,0 +1 @@ +Origin-destination (OD) crowd flow, if more accurately inferred at a fine-grained level, has the potential to enhance the efficacy of various urban applications. While in practice for mining OD crowd flow with effect, the problem of spatially interpolating OD crowd flow occurs since the ineluctable missing values. This problem is further complicated by the inherent scarcity and noise nature of OD crowd flow data. In this paper, we propose an uncertainty-aware interpolative and explainable framework, namely UApex, for realizing reliable and trustworthy OD crowd flow interpolation. Specifically, we first design a Variational Multi-modal Recurrent Graph Auto-Encoder (VMR-GAE) for uncertainty-aware OD crowd flow interpolation. A key idea here is to formulate the problem as semi-supervised learning on directed graphs. Next, to mitigate the data scarcity, we incorporate a distribution alignment mechanism that can introduce supplementary modals into variational inference. Then, a dedicated decoder with a Poisson prior is proposed for OD crowd flow interpolation. Moreover, to make VMR-GAE more trustworthy, we develop an efficient and uncertainty-aware explainer that can provide explanations from the spatiotemporal topology perspective via the Shapley value. Extensive experiments on two real-world datasets validate that VMR-GAE outperforms the state-of-the-art baselines. Also, an exploratory empirical study shows that the proposed explainer can generate meaningful spatiotemporal explanations. \ No newline at end of file diff --git a/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts b/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts new file mode 100644 index 0000000000..e0d6128774 --- /dev/null +++ b/data/2024/aaai/Explaining Generalization Power of a DNN Using Interactive Concepts @@ -0,0 +1 @@ +This paper explains the generalization power of a deep neural network (DNN) from the perspective of interactions. Although there is no universally accepted definition of the concepts encoded by a DNN, the sparsity of interactions in a DNN has been proved, i.e., the output score of a DNN can be well explained by a small number of interactions between input variables. In this way, to some extent, we can consider such interactions as interactive concepts encoded by the DNN. Therefore, in this paper, we derive an analytic explanation of inconsistency of concepts of different complexities. This may shed new lights on using the generalization power of concepts to explain the generalization power of the entire DNN. Besides, we discover that the DNN with stronger generalization power usually learns simple concepts more quickly and encodes fewer complex concepts. We also discover the detouring dynamics of learning complex concepts, which explains both the high learning difficulty and the low generalization power of complex concepts. The code will be released when the paper is accepted. \ No newline at end of file diff --git a/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking b/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking new file mode 100644 index 0000000000..407604d441 --- /dev/null +++ b/data/2024/aaai/Explicit Visual Prompts for Visual Object Tracking @@ -0,0 +1,3 @@ +How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the when-and-how-to-update dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed EVPTrack. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of when-to-update, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. +Consequently, the efficiency of our model is improved by avoiding how-to-update. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOText, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at +https://github.com/GXNU-ZhongLab/EVPTrack. \ No newline at end of file diff --git a/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack b/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack new file mode 100644 index 0000000000..acd175df1e --- /dev/null +++ b/data/2024/aaai/Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack @@ -0,0 +1 @@ +Deep learning models for point clouds have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving, robotics, and surveillance. Existing 3D attack methods generally employ global distance losses to implicitly constrain the point-wise perturbations for optimization. However, these simple losses are quite difficult to accurately measure and restrict the proper 3D geometry as point clouds are highly structured. Although few recent works try to exploit additional shape-aware surface knowledge to globally constrain the point position, they still fail to preserve the detailed point-to-point geometric dependency in different local regions. To this end, in this paper, we propose a novel Multi-grained Geometry-aware Attack (MGA), which explicitly captures the local topology characteristics in different 3D regions for adversarial constraint. Specifically, we first develop multi-scale spectral local filter banks adapting to different 3D object shapes to explore potential geometric structures in local regions. Considering that objects may contain complex geometries, we then extend each filter bank into multi-layer ones to gradually capture the topology contexts of the same region in a coarse-to-fine manner. Hence, the focused local geometric structures will be highlighted in the coefficients calculated by the filtering process. At last, by restricting these coefficients between benign and adversarial samples, our MGA is able to properly measure and preserve the detailed geometry contexts in the whole 3D object with trivial perturbations. Extensive experiments demonstrate that our attack can achieve superior performance on various 3D classification models, with satisfying adversarial imperceptibility and strong resistance to different defense methods. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint) b/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint) new file mode 100644 index 0000000000..20c392d90a --- /dev/null +++ b/data/2024/aaai/Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint) @@ -0,0 +1 @@ +Offline reinforcement learning—learning a policy from a batch of data—is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding b/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding new file mode 100644 index 0000000000..032a4fab30 --- /dev/null +++ b/data/2024/aaai/Exploiting Auxiliary Caption for Video Grounding @@ -0,0 +1 @@ +Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions, i.e., auxiliary captions defined in our paper, will significantly boost the performance. To this end, we propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS). To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between auxiliary captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Extensive experiments on three public datasets (i.e., ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Data Geometry in Machine Learning b/data/2024/aaai/Exploiting Data Geometry in Machine Learning new file mode 100644 index 0000000000..a66ee454a1 --- /dev/null +++ b/data/2024/aaai/Exploiting Data Geometry in Machine Learning @@ -0,0 +1 @@ +A key challenge in Machine Learning (ML) is the identification of geometric structure in high-dimensional data. Most algorithms assume that data lives in a high-dimensional vector space; however, many applications involve non-Euclidean data, such as graphs, strings and matrices, or data whose structure is determined by symmetries in the underlying system. Here, we discuss methods for identifying geometric structure in data and how leveraging data geometry can give rise to efficient ML algorithms with provable guarantees. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection b/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection new file mode 100644 index 0000000000..643f80ac90 --- /dev/null +++ b/data/2024/aaai/Exploiting Discrepancy in Feature Statistic for Out-of-Distribution Detection @@ -0,0 +1,2 @@ +Recent studies on out-of-distribution (OOD) detection focus on designing models or scoring functions that can effectively distinguish between unseen OOD data and in-distribution (ID) data. In this paper, we propose a simple yet novel ap- +proach to OOD detection by leveraging the phenomenon that the average of feature vector elements from convolutional neural network (CNN) is typically larger for ID data than for OOD data. Specifically, the average of feature vector elements is used as part of the scoring function to further separate OOD data from ID data. We also provide mathematical analysis to explain this phenomenon. Experimental evaluations demonstrate that, when combined with a strong baseline, our method can achieve state-of-the-art performance on several OOD detection benchmarks. Furthermore, our method can be easily integrated into various CNN architectures and requires less computation. Source code address: https://github.com/SYSU-MIA-GROUP/statistical_discrepancy_ood. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport b/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport new file mode 100644 index 0000000000..761cd9be18 --- /dev/null +++ b/data/2024/aaai/Exploiting Geometry for Treatment Effect Estimation via Optimal Transport @@ -0,0 +1 @@ +Estimating treatment effects from observational data suffers from the issue of confounding bias, which is induced by the imbalanced confounder distributions between the treated and control groups. As an effective approach, re-weighting learns a group of sample weights to balance the confounder distributions. Existing methods of re-weighting highly rely on a propensity score model or moment alignment. However, for complex real-world data, it is difficult to obtain an accurate propensity score prediction. Although moment alignment is free of learning a propensity score model, accurate estimation for high-order moments is computationally difficult and still remains an open challenge, and first and second-order moments are insufficient to align the distributions and easy to be misled by outliers. In this paper, we exploit geometry to capture the intrinsic structure involved in data for balancing the confounder distributions, so that confounding bias can be reduced even with outliers. To achieve this, we construct a connection between treatment effect estimation and optimal transport, a powerful tool to capture geometric information. After that, we propose an optimal transport model to learn sample weights by extracting geometry from confounders, in which geometric information between groups and within groups is leveraged for better confounder balancing. A projected mirror descent algorithm is employed to solve the derived optimization problem. Experimental studies on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation b/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation new file mode 100644 index 0000000000..59f918d5b3 --- /dev/null +++ b/data/2024/aaai/Exploiting Label Skews in Federated Learning with Model Concatenation @@ -0,0 +1 @@ +Federated Learning (FL) has emerged as a promising solution to perform deep learning on different data owners without exchanging raw data. However, non-IID data has been a key challenge in FL, which could significantly degrade the accuracy of the final model. Among different non-IID types, label skews have been challenging and common in image classification and other tasks. Instead of averaging the local models in most previous studies, we propose FedConcat, a simple and effective approach that concatenates these local models as the base of the global model to effectively aggregate the local knowledge. To reduce the size of the global model, we adopt the clustering technique to group the clients by their label distributions and collaboratively train a model inside each cluster. We theoretically analyze the advantage of concatenation over averaging by analyzing the information bottleneck of deep neural networks. Experimental results demonstrate that FedConcat achieves significantly higher accuracy than previous state-of-the-art FL methods in various heterogeneous label skew distribution settings and meanwhile has lower communication costs. Our code is publicly available at https://github.com/sjtudyq/FedConcat. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection b/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection new file mode 100644 index 0000000000..de5d9145fc --- /dev/null +++ b/data/2024/aaai/Exploiting Polarized Material Cues for Robust Car Detection @@ -0,0 +1 @@ +Car detection is an important task that serves as a crucial prerequisite for many automated driving functions. The large variations in lighting/weather conditions and vehicle densities of the scenes pose significant challenges to existing car detection algorithms to meet the highly accurate perception demand for safety, due to the unstable/limited color information, which impedes the extraction of meaningful/discriminative features of cars. In this work, we present a novel learning-based car detection method that leverages trichromatic linear polarization as an additional cue to disambiguate such challenging cases. A key observation is that polarization, characteristic of the light wave, can robustly describe intrinsic physical properties of the scene objects in various imaging conditions and is strongly linked to the nature of materials for cars (e.g., metal and glass) and their surrounding environment (e.g., soil and trees), thereby providing reliable and discriminative features for robust car detection in challenging scenes. To exploit polarization cues, we first construct a pixel-aligned RGB-Polarization car detection dataset, which we subsequently employ to train a novel multimodal fusion network. Our car detection network dynamically integrates RGB and polarization features in a request-and-complement manner and can explore the intrinsic material properties of cars across all learning samples. We extensively validate our method and demonstrate that it outperforms state-of-the-art detection methods. Experimental results show that polarization is a powerful cue for car detection. Our code is available at https://github.com/wind1117/AAAI24-PCDNet. \ No newline at end of file diff --git a/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning b/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning new file mode 100644 index 0000000000..9d8d45edc0 --- /dev/null +++ b/data/2024/aaai/Exploiting the Social-Like Prior in Transformer for Visual Reasoning @@ -0,0 +1 @@ +Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as visual question answering (VQA) and referring expression comprehension (REC). However, some studies have recently suggested that SA tends to suffer from rank collapse thereby inevitably leads to representation degradation as the transformer layer goes deeper. Inspired by social network theory, we attempt to make an analogy between social behavior and regional information interaction in SA, and harness two crucial notions of structural hole and degree centrality in social network to explore the possible optimization towards SA learning, which naturally deduces two plug-and-play social-like modules. Based on structural hole, the former module allows to make information interaction in SA more structured, which effectively avoids redundant information aggregation and global feature homogenization for better rank remedy, followed by latter module to comprehensively characterize and refine the representation discrimination via considering degree centrality of regions and transitivity of relations. Without bells and whistles, our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA. \ No newline at end of file diff --git a/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations b/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations new file mode 100644 index 0000000000..00f12ceb32 --- /dev/null +++ b/data/2024/aaai/Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations @@ -0,0 +1 @@ +This paper presents an Exploratory 3D Dance generation framework, E3D2, designed to address the exploration capability deficiency in existing music-conditioned 3D dance generation models. Current models often generate monotonous and simplistic dance sequences that misalign with human preferences because they lack exploration capabilities.The E3D2 framework involves a reward model trained from automatically-ranked dance demonstrations, which then guides the reinforcement learning process. This approach encourages the agent to explore and generate high quality and diverse dance movement sequences. The soundness of the reward model is both theoretically and experimentally validated. Empirical experiments demonstrate the effectiveness of E3D2 on the AIST++ dataset. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection b/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection new file mode 100644 index 0000000000..e8de13f12c --- /dev/null +++ b/data/2024/aaai/Exploring Base-Class Suppression with Prior Guidance for Bias-Free One-Shot Object Detection @@ -0,0 +1 @@ +One-shot object detection (OSOD) aims to detect all object instances towards the given category specified by a query image. Most existing studies in OSOD endeavor to establish effective cross-image correlation with limited query information, however, ignoring the problems of the model bias towards the base classes and the generalization degradation on the novel classes. Observing this, we propose a novel algorithm, namely Base-class Suppression with Prior Guidance (BSPG) network to achieve bias-free OSOD. Specifically, the objects of base categories can be detected by a base-class predictor and eliminated by a base-class suppression module (BcS). Moreover, a prior guidance module (PG) is designed to calculate the correlation of high-level features in a non-parametric manner, producing a class-agnostic prior map with unbiased semantic information to guide the subsequent detection process. Equipped with the proposed two modules, we endow the model with a strong discriminative ability to distinguish the target objects from distractors belonging to the base classes. Extensive experiments show that our method outperforms the previous techniques by a large margin and achieves new state-of-the-art performance under various evaluation settings. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection b/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection new file mode 100644 index 0000000000..41a76cb809 --- /dev/null +++ b/data/2024/aaai/Exploring Channel-Aware Typical Features for Out-of-Distribution Detection @@ -0,0 +1 @@ +Detecting out-of-distribution (OOD) data is essential to ensure the reliability of machine learning models when deployed in real-world scenarios. Different from most previous test-time OOD detection methods that focus on designing OOD scores, we delve into the challenges in OOD detection from the perspective of typicality and regard the feature’s high-probability region as the feature’s typical set. However, the existing typical-feature-based OOD detection method implies an assumption: the proportion of typical feature sets for each channel is fixed. According to our experimental analysis, each channel contributes differently to OOD detection. Adopting a fixed proportion for all channels results in several channels losing too many typical features or incorporating too many abnormal features, resulting in low performance. Therefore, exploring the channel-aware typical features is crucial to better-separating ID and OOD data. Driven by this insight, we propose expLoring channel-Aware tyPical featureS (LAPS). Firstly, LAPS obtains the channel-aware typical set by calibrating the channel-level typical set with the global typical set from the mean and standard deviation. Then, LAPS rectifies the features into channel-aware typical sets to obtain channel-aware typical features. Finally, LAPS leverages the channel-aware typical features to calculate the energy score for OOD detection. Theoretical and visual analyses verify that LAPS achieves a better bias-variance trade-off. Experiments verify the effectiveness and generalization of LAPS under different architectures and OOD scores. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark b/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark new file mode 100644 index 0000000000..1f65916a72 --- /dev/null +++ b/data/2024/aaai/Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark @@ -0,0 +1 @@ +Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: ingredients, cooking, presentation, and eating. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. Concerning the classic datasets, GPE also yields comparable performance as previous arts. The code is available at: https://github.com/ForeverPs/IncrementalVHD_GPE. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models b/data/2024/aaai/Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective b/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective new file mode 100644 index 0000000000..8a09c4ec3a --- /dev/null +++ b/data/2024/aaai/Exploring Gradient Explosion in Generative Adversarial Imitation Learning: A Probabilistic Perspective @@ -0,0 +1 @@ +Generative Adversarial Imitation Learning (GAIL) stands as a cornerstone approach in imitation learning. This paper investigates the gradient explosion in two types of GAIL: GAIL with deterministic policy (DE-GAIL) and GAIL with stochastic policy (ST-GAIL). We begin with the observation that the training can be highly unstable for DE-GAIL at the beginning of the training phase and end up divergence. Conversely, the ST-GAIL training trajectory remains consistent, reliably converging. To shed light on these disparities, we provide an explanation from a theoretical perspective. By establishing a probabilistic lower bound for GAIL, we demonstrate that gradient explosion is an inevitable outcome for DE-GAIL due to occasionally large expert-imitator policy disparity, whereas ST-GAIL does not have the issue with it. To substantiate our assertion, we illustrate how modifications in the reward function can mitigate the gradient explosion challenge. Finally, we propose CREDO, a simple yet effective strategy that clips the reward function during the training phase, allowing the GAIL to enjoy high data efficiency and stable trainability. \ No newline at end of file diff --git a/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models b/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models new file mode 100644 index 0000000000..2a81a6ba7d --- /dev/null +++ b/data/2024/aaai/Exploring One-Shot Semi-supervised Federated Learning with Pre-trained Diffusion Models @@ -0,0 +1 @@ +Recently, semi-supervised federated learning (semi-FL) has been proposed to handle the commonly seen real-world scenarios with labeled data on the server and unlabeled data on the clients. However, existing methods face several challenges such as communication costs, data heterogeneity, and training pressure on client devices. To address these challenges, we introduce the powerful diffusion models (DM) into semi-FL and propose FedDISC, a Federated Diffusion-Inspired Semi-supervised Co-training method. Specifically, we first extract prototypes of the labeled server data and use these prototypes to predict pseudo-labels of the client data. For each category, we compute the cluster centroids and domain-specific representations to signify the semantic and stylistic information of their distributions. After adding noise, these representations are sent back to the server, which uses the pre-trained DM to generate synthetic datasets complying with the client distributions and train a global model on it. With the assistance of vast knowledge within DM, the synthetic datasets have comparable quality and diversity to the client images, subsequently enabling the training of global models that achieve performance equivalent to or even surpassing the ceiling of supervised centralized training. FedDISC works within one communication round, does not require any local training, and involves very minimal information uploading, greatly enhancing its practicality. Extensive experiments on three large-scale datasets demonstrate that FedDISC effectively addresses the semi-FL problem on non-IID clients and outperforms the compared SOTA methods. Sufficient visualization experiments also illustrate that the synthetic dataset generated by FedDISC exhibits comparable diversity and quality to the original client dataset, with a neglectable possibility of leaking privacy-sensitive information of the clients. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation b/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation new file mode 100644 index 0000000000..7683f30f15 --- /dev/null +++ b/data/2024/aaai/Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation @@ -0,0 +1 @@ +Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection b/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection new file mode 100644 index 0000000000..1ede98348d --- /dev/null +++ b/data/2024/aaai/Exploring Self- and Cross-Triplet Correlations for Human-Object Interaction Detection @@ -0,0 +1 @@ +Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction b/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction new file mode 100644 index 0000000000..82df9b01e1 --- /dev/null +++ b/data/2024/aaai/Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction @@ -0,0 +1 @@ +The visual prompts have provided an efficient manner in addressing visual cross-domain problems. Previous works introduce domain prompts to tackle the classification Test-Time Adaptation (TTA) problem by placing image-level prompts on the input and fine-tuning prompts for each target domain. However, since the image-level prompts mask out continuous spatial details in the prompt-allocated region, it will suffer from inaccurate contextual information and limited domain knowledge extraction, particularly when dealing with dense prediction TTA problems. To overcome these challenges, we propose a novel Sparse Visual Domain Prompts (SVDP) approach, which applies minimal trainable parameters (e.g., 0.1%) to pixels across the entire image and reserves more spatial information of the input. To better apply SVDP in extracting domain-specific knowledge, we introduce the Domain Prompt Placement (DPP) method to adaptively allocates trainable parameters of SVDP on the pixels with large distribution shifts. Furthermore, recognizing that each target domain sample exhibits a unique domain shift, we design Domain Prompt Updating (DPU) strategy to optimize prompt parameters differently for each sample, facilitating efficient adaptation to the target domain. Extensive experiments were conducted on widely-used TTA and continual TTA benchmarks, and our proposed method achieves state-of-the-art performance in both semantic segmentation and depth estimation tasks. \ No newline at end of file diff --git a/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation b/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation new file mode 100644 index 0000000000..d9c6a5b37c --- /dev/null +++ b/data/2024/aaai/Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation @@ -0,0 +1 @@ +This paper tackles the problem of efficient and stable video semantic segmentation. While stability has been under-explored, prevalent work in efficient video semantic segmentation uses the keyframe paradigm. They efficiently process videos by only recomputing the low-level features and reusing high-level features computed at selected keyframes. In addition, the reused features stabilize the predictions across frames, thereby improving video consistency. However, dynamic scenes in the video can easily lead to misalignments between reused and recomputed features, which hampers performance. Moreover, relying on feature reuse to improve prediction consistency is brittle; an erroneous alignment of the features can easily lead to unstable predictions. Therefore, the keyframe paradigm exhibits a dilemma between stability and performance. We address this efficiency and stability challenge using a novel yet simple Temporal Feature Correlation (TFC) module. It uses the cosine similarity between two frames’ low-level features to inform the semantic label’s consistency across frames. Specifically, we selectively reuse label-consistent features across frames through linear interpolation and update others through sparse multi-scale deformable attention. As a result, we no longer directly reuse features to improve stability and thus effectively solve feature misalignment. This work provides a significant step towards efficient and stable video semantic segmentation. On the VSPW dataset, our method significantly improves the prediction consistency of image-based methods while being as fast and accurate. \ No newline at end of file diff --git a/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification b/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification new file mode 100644 index 0000000000..748e2b42c3 --- /dev/null +++ b/data/2024/aaai/Exponent Relaxation of Polynomial Zonotopes and Its Applications in Formal Neural Network Verification @@ -0,0 +1,9 @@ +Formal verification of neural networks is a challenging problem due to the complexity and nonlinearity of neural networks. + It has been shown that polynomial zonotopes can tightly enclose the output set of a neural network. + Unfortunately, the tight enclosure comes with additional complexity in the set representation, + thus, rendering subsequent operations expensive to compute, such as computing interval bounds and intersection checking. + To address this issue, we present a novel approach to restructure a polynomial zonotope to tightly enclose the original polynomial zonotope + while drastically reducing its complexity. + The restructuring is achieved by relaxing the exponents of the dependent factors of polynomial zonotopes and finding an appropriate approximation error. + We demonstrate the applicability of our approach on output sets of neural networks, + where we obtain tighter results in various subsequent operations, such as order reduction, zonotope enclosure, and range bounding. \ No newline at end of file diff --git a/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks b/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks new file mode 100644 index 0000000000..b35b2b66d0 --- /dev/null +++ b/data/2024/aaai/Exponential Hardness of Optimization from the Locality in Quantum Neural Networks @@ -0,0 +1 @@ +Quantum neural networks (QNNs) have become a leading paradigm for establishing near-term quantum applications in recent years. The trainability issue of QNNs has garnered extensive attention, spurring demand for a comprehensive analysis of QNNs in order to identify viable solutions. In this work, we propose a perspective that characterizes the trainability of QNNs based on their locality. We prove that the entire variation range of the loss function via adjusting any local quantum gate vanishes exponentially in the number of qubits with a high probability for a broad class of QNNs. This result reveals extra harsh constraints independent of gradients and unifies the restrictions on gradient-based and gradient-free optimizations naturally. We showcase the validity of our results with numerical simulations of representative models and examples. Our findings, as a fundamental property of random quantum circuits, deepen the understanding of the role of locality in QNNs and serve as a guideline for assessing the effectiveness of diverse training strategies for quantum neural networks. \ No newline at end of file diff --git a/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection b/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection new file mode 100644 index 0000000000..a3db4777ef --- /dev/null +++ b/data/2024/aaai/Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection @@ -0,0 +1,3 @@ +Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of high-fidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing only on forgery clues within one or a few local regions. Moreover, existing works heavily rely on neural networks to extract forgery features, lacking theoretical constraints guaranteeing that sufficient forgery clues are extracted and superfluous features are eliminated. These deficiencies culminate in unsatisfactory accuracy and limited generalizability in real-life scenarios. + +In this paper, we try to tackle these challenges through three designs: (1) We present a novel framework to capture broader forgery clues by extracting multiple non-overlapping local representations and fusing them into a global semantic-rich feature. (2) Based on the information bottleneck theory, we derive Local Information Loss to guarantee the orthogonality of local representations while preserving comprehensive task-relevant information. (3) Further, to fuse the local representations and remove task-irrelevant information, we arrive at a Global Information Loss through the theoretical analysis of mutual information. Empirically, our method achieves state-of-the-art performance on five benchmark datasets. Our code is available at https://github.com/QingyuLiu/Exposing-the-Deception, hoping to inspire researchers. \ No newline at end of file diff --git a/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions b/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions new file mode 100644 index 0000000000..d8e4a93ff2 --- /dev/null +++ b/data/2024/aaai/Expressive Forecasting of 3D Whole-Body Human Motions @@ -0,0 +1,2 @@ +Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on foretelling the major joints of the human body without considering the delicate movements of the human hands. +In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate whole-body human pose forecasting task, which jointly predicts future both body and gesture activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI. \ No newline at end of file diff --git a/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning b/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning new file mode 100644 index 0000000000..2bb88d1bc7 --- /dev/null +++ b/data/2024/aaai/Expressive Multi-Agent Communication via Identity-Aware Learning @@ -0,0 +1 @@ +Information sharing through communication is essential for tackling complex multi-agent reinforcement learning tasks. Many existing multi-agent communication protocols can be viewed as instances of message passing graph neural networks (GNNs). However, due to the significantly limited expressive ability of the standard GNN method, the agent feature representations remain similar and indistinguishable even though the agents have different neighborhood structures. This further results in the homogenization of agent behaviors and reduces the capability to solve tasks effectively. In this paper, we propose a multi-agent communication protocol via identity-aware learning (IDEAL), which explicitly enhances the distinguishability of agent feature representations to break the diversity bottleneck. Specifically, IDEAL extends existing multi-agent communication protocols by inductively considering the agents' identities during the message passing process. To obtain expressive feature representations for a given agent, IDEAL first extracts the ego network centered around that agent and then performs multiple rounds of heterogeneous message passing, where different parameter sets are applied to the central agent and the other surrounding agents within the ego network. IDEAL fosters expressive communication between agents and generates distinguishable feature representations, which promotes action diversity and individuality emergence. Experimental results on various benchmarks demonstrate IDEAL can be flexibly integrated into various multi-agent communication methods and enhances the corresponding performance. \ No newline at end of file diff --git a/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning b/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning new file mode 100644 index 0000000000..4b5e707817 --- /dev/null +++ b/data/2024/aaai/Expressive and Flexible Simulation of Information Spread Strategies in Social Networks Using Planning @@ -0,0 +1 @@ +In the digital age, understanding the dynamics of information spread and opinion formation within networks is paramount. This research introduces an innovative framework that combines the principles of opinion dynamics with the strategic capabilities of Automated Planning. We have developed, to the best of our knowledge, the first-ever numeric PDDL tailored for opinion dynamics. Our tool empowers users to visualize intricate networks, simulate the evolution of opinions, and strategically influence that evolution to achieve specific outcomes. By harnessing Automated Planning techniques, our framework offers a nuanced approach to devise sequences of actions tailored to transition a network from its current opinion landscape to a desired state. This holistic approach provides insights into the intricate interplay of individual nodes within a network and paves the way for targeted interventions. Furthermore, the tool facilitates human-AI collaboration, enabling users to not only understand information spread but also devise practical strategies to mitigate potential harmful outcomes arising from it. Demo Video link - https://tinyurl.com/3k7bp99h \ No newline at end of file diff --git a/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks b/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks new file mode 100644 index 0000000000..fe7705e570 --- /dev/null +++ b/data/2024/aaai/FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks @@ -0,0 +1 @@ +Deep neural networks are known to be vulnerable to security risks due to the inherent transferable nature of adversarial examples. Despite the success of recent generative model-based attacks demonstrating strong transferability, it still remains a challenge to design an efficient attack strategy in a real-world strict black-box setting, where both the target domain and model architectures are unknown. In this paper, we seek to explore a feature contrastive approach in the frequency domain to generate adversarial examples that are robust in both cross-domain and cross-model settings. With that goal in mind, we propose two modules that are only employed during the training phase: a Frequency-Aware Domain Randomization (FADR) module to randomize domain-variant low- and high-range frequency components and a Frequency-Augmented Contrastive Learning (FACL) module to effectively separate domain-invariant mid-frequency features of clean and perturbed image. We demonstrate strong transferability of our generated adversarial perturbations through extensive cross-domain and cross-model experiments, while keeping the inference time complexity. \ No newline at end of file diff --git a/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract) b/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract) new file mode 100644 index 0000000000..1cbdeaaa29 --- /dev/null +++ b/data/2024/aaai/FAIR-FER: A Latent Alignment Approach for Mitigating Bias in Facial Expression Recognition (Student Abstract) @@ -0,0 +1 @@ +Facial Expression Recognition (FER) is an extensively explored research problem in the domain of computer vision and artificial intelligence. FER, a supervised learning problem, requires significant training data representative of multiple socio-cultural demographic attributes. However, most of the FER dataset consists of images annotated by humans, which propagates individual and demographic biases. This work attempts to mitigate this bias using representation learning based on latent spaces, thereby increasing a deep learning model's fairness and overall accuracy. \ No newline at end of file diff --git a/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text b/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text new file mode 100644 index 0000000000..348ce964ca --- /dev/null +++ b/data/2024/aaai/FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text @@ -0,0 +1 @@ +Rearrangement operations form the crux of interactions between humans and their environment. The ability to generate natural, fluid sequences of this operation is of essential value in AR/VR and CG. Bridging a gap in the field, our study introduces FAVOR: a novel dataset for Full-body AR-driven Virtual Object Rearrangement that uniquely employs motion capture systems and AR eyeglasses. Comprising 3k diverse motion rearrangement sequences and 7.17 million interaction data frames, this dataset breaks new ground in research data. We also present a pipeline FAVORITE for producing digital human rearrangement motion sequences guided by instructions. Experimental results, both qualitative and quantitative, suggest that this dataset and pipeline deliver high-quality motion sequences. Our dataset, code, and appendix are available at https://kailinli.github.io/FAVOR. \ No newline at end of file diff --git a/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection b/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection new file mode 100644 index 0000000000..ebca5ff0f3 --- /dev/null +++ b/data/2024/aaai/FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection @@ -0,0 +1 @@ +Monocular 3D object detection usually adopts direct or hierarchical label supervision. Recently, the distillation supervision transfers the spatial knowledge from LiDAR- or stereo-based teacher networks to monocular detectors, but remaining the domain gap. To mitigate this issue and pursue adequate label manipulation, we exploit Foreground Depth map for feature-supervised monocular 3D object detection named FD3D, which develops the high-quality instructive intermediate features to conduct desirable auxiliary feature supervision with only the original image and annotation foreground object-wise depth map (AFOD) as input. Furthermore, we build up our instructive feature generation network to create instructive spatial features based on the sufficient correlation between image features and pre-processed AFOD, where AFOD provides the attention focus only on foreground objects to achieve clearer guidance in the detection task. Moreover, we apply the auxiliary feature supervision from the pixel and distribution level to achieve comprehensive spatial knowledge guidance. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both the KITTI and nuScenes datasets, with no external data and no extra inference computational cost. We also conduct quantitative and qualitative studies to reveal the effectiveness of our designs. \ No newline at end of file diff --git a/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision b/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision new file mode 100644 index 0000000000..47684a74f4 --- /dev/null +++ b/data/2024/aaai/FFT-Based Dynamic Token Mixer for Vision @@ -0,0 +1 @@ +Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer involves global operations similar to MHSA but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer, to close the gaps above. The results of image classification and downstream tasks, analysis, and visualization show that our models are helpful. Notably, their throughput and memory efficiency when dealing with high-resolution image recognition is remarkable. Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer \ No newline at end of file diff --git a/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions b/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions new file mode 100644 index 0000000000..e2c8099b32 --- /dev/null +++ b/data/2024/aaai/FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions @@ -0,0 +1 @@ +Although deep generative models have greatly improved one-shot video-driven talking head generation, few studies address fine-grained controllable facial expression editing, which is crucial for practical applications. Existing methods rely on a fixed set of predefined discrete emotion labels or simply copy expressions from input videos. This is limiting as expressions are complex, and methods using only emotion labels cannot generate fine-grained, accurate or mixed expressions. Generating talking head video with precise expressions is also difficult using 3D model-based approaches, as 3DMM only models facial movements and tends to produce deviations. In this paper, we propose a novel framework enabling fine-grained facial expression editing in talking face generation. Our goal is to achieve expression control by manipulating the intensities of individual facial Action Units (AUs) or groups. First, compared with existing methods which decouple the face into pose and expression, we propose a disentanglement scheme to isolates three components from the human face, namely, appearance, pose, and expression. Second, we propose to use input AUs to control muscle group intensities in the generated face, and integrate the AUs features with the disentangled expression latent code. Finally, we present a self-supervised training strategy with well-designed constraints. Experiments show our method achieves fine-grained expression control, produces high-quality talking head videos and outperforms baseline methods. \ No newline at end of file diff --git a/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas b/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas new file mode 100644 index 0000000000..96bb87d9ac --- /dev/null +++ b/data/2024/aaai/FLAME: A Small Language Model for Spreadsheet Formulas @@ -0,0 +1,15 @@ +Spreadsheets are a vital tool for end-user data management. Using large language +models for formula authoring assistance in these environments can be difficult, +as these models are expensive to train and challenging to deploy due to their +size (up to billions of parameters). We present FLAME, a transformer-based model +trained exclusively on Excel formulas that leverages domain insights to achieve +competitive performance while being substantially smaller (60M parameters) and +training on two orders of magnitude less data. We curate a training dataset +using sketch deduplication, introduce an Excel-specific formula tokenizer, and +use domain-specific versions of masked span prediction and noisy auto-encoding +as pre-training objectives. We evaluate FLAME on formula repair, formula +completion, and similarity-based formula retrieval. FLAME can outperform much +larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex +and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion +tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and +GraphCodeBERT. \ No newline at end of file diff --git a/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection b/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection new file mode 100644 index 0000000000..c117fee597 --- /dev/null +++ b/data/2024/aaai/FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection @@ -0,0 +1 @@ +The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git. \ No newline at end of file diff --git a/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision b/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision new file mode 100644 index 0000000000..07d3af0111 --- /dev/null +++ b/data/2024/aaai/FMRNet: Image Deraining via Frequency Mutual Revision @@ -0,0 +1 @@ +The wavelet transform has emerged as a powerful tool in deciphering structural information within images. And now, the latest research suggests that combining the prowess of wavelet transform with neural networks can lead to unparalleled image deraining results. By harnessing the strengths of both the spatial domain and frequency space, this innovative approach is poised to revolutionize the field of image processing. The fascinating challenge of developing a comprehensive framework that takes into account the intrinsic frequency property and the correlation between rain residue and background is yet to be fully explored. In this work, we propose to investigate the potential relationships among rain-free and residue components at the frequency domain, forming a frequency mutual revision network (FMRNet) for image deraining. Specifically, we explore the mutual representation of rain residue and background components at frequency domain, so as to better separate the rain layer from clean background while preserving structural textures of the degraded images. Meanwhile, the rain distribution prediction from the low-frequency coefficient, which can be seen as the degradation prior is used to refine the separation of rain residue and background components. Inversely, the updated rain residue is used to benefit the low-frequency rain distribution prediction, forming the multi-layer mutual learning. Extensive experiments demonstrate that our proposed FMRNet delivers significant performance gains for seven datasets on image deraining task, surpassing the state-of-the-art method ELFormer by 1.14 dB in PSNR on the Rain100L dataset, while with similar computation cost. Code and retrained models are available at https://github.com/kuijiang94/FMRNet. \ No newline at end of file diff --git a/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields b/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields new file mode 100644 index 0000000000..ee063642ac --- /dev/null +++ b/data/2024/aaai/FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields @@ -0,0 +1 @@ +We present FPRF, a feed-forward photorealistic style transfer method for large-scale 3D neural radiance fields. FPRF stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view appearance consistency. Prior arts required tedious per-style/-scene optimization and were limited to small-scale 3D scenes. FPRF efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D neural radiance field, which inherits AdaIN’s feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPRF supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPRF also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPRF achieves favorable photorealistic quality 3D scene stylization for large-scale scenes with diverse reference images. \ No newline at end of file diff --git a/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection b/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection new file mode 100644 index 0000000000..3884b2e9ff --- /dev/null +++ b/data/2024/aaai/FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection @@ -0,0 +1 @@ +Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%. \ No newline at end of file diff --git a/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization b/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization new file mode 100644 index 0000000000..60bf1ce790 --- /dev/null +++ b/data/2024/aaai/FRIH: Fine-Grained Region-Aware Image Harmonization @@ -0,0 +1 @@ +Image harmonization aims to generate a more realistic appearance of foreground and background for a composite image. All the existing methods perform the same harmonization process for the whole foreground. However, the implanted foreground always contains different appearance patterns. Existing solutions ignore the difference of each color block and lose some specific details. Therefore, we propose a novel global-local two stages framework for Fine-grained Region-aware Image Harmonization (FRIH). In the first stage, the whole input foreground mask is used to make a global coarse-grained harmonization. In the second stage, we adaptively cluster the input foreground mask into several submasks. Each submask and the coarsely adjusted image are concatenated respectively and fed into a lightweight cascaded module, refining the global harmonization result. Moreover, we further design a fusion prediction module to generate the final result, utilizing the different degrees of harmonization results comprehensively. Without bells and whistles, our FRIH achieves a competitive performance on iHarmony4 dataset with a lightweight model. \ No newline at end of file diff --git a/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis b/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis new file mode 100644 index 0000000000..5cd27c27f7 --- /dev/null +++ b/data/2024/aaai/FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis @@ -0,0 +1 @@ +Although singing voice synthesis (SVS) has made significant progress recently, with its unique styles and various genres, Chinese opera synthesis requires greater attention but is rarely studied for lack of training data and high expressiveness. In this work, we build a high-quality Gezi Opera (a type of Chinese opera popular in Fujian and Taiwan) audio-text alignment dataset and formulate specific data annotation methods applicable to Chinese operas. We propose FT-GAN, an acoustic model for fine-grained tune modeling in Chinese opera synthesis based on the empirical analysis of the differences between Chinese operas and pop songs. To further improve the quality of the synthesized opera, we propose a speech pre-training strategy for additional knowledge injection. The experimental results show that FT-GAN outperforms the strong baselines in SVS on the Gezi Opera synthesis task. Extensive experiments further verify that FT-GAN performs well on synthesis tasks of other operas such as Peking Opera. Audio samples, the dataset, and the codes are available at https://zhengmidon.github.io/FTGAN.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition b/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition new file mode 100644 index 0000000000..c1e4bbe3c8 --- /dev/null +++ b/data/2024/aaai/FaceCoresetNet: Differentiable Coresets for Face Set Recognition @@ -0,0 +1,3 @@ +In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation. +This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size. +We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available at https://github.com/ligaripash/FaceCoresetNet. \ No newline at end of file diff --git a/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework b/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework new file mode 100644 index 0000000000..c74e71bab2 --- /dev/null +++ b/data/2024/aaai/FaceRSA: RSA-Aware Facial Identity Cryptography Framework @@ -0,0 +1 @@ +With the flourishing of the Internet, sharing one's photos or automated processing of faces using computer vision technology has become an everyday occurrence. While enjoying the convenience, the concern for identity privacy is also emerging. Therefore, some efforts introduced the concept of ``password'' from traditional cryptography such as RSA into the face anonymization and deanonymization task to protect the facial identity without compromising the usability of the face image. However, these methods either suffer from the poor visual quality of the synthesis results or do not possess the full cryptographic properties, resulting in compromised security. In this paper, we present the first facial identity cryptography framework with full properties analogous to RSA. Our framework leverages the powerful generative capabilities of StyleGAN to achieve megapixel-level facial identity anonymization and deanonymization. Thanks to the great semantic decoupling of StyleGAN's latent space, the identity encryption and decryption process are performed in latent space by a well-designed password mapper in the manner of editing latent code. Meanwhile, the password-related information is imperceptibly hidden in the edited latent code owing to the redundant nature of the latent space. To make our cryptographic framework possesses all the properties analogous to RSA, we propose three types of loss functions: single anonymization loss, sequential anonymization loss, and associated anonymization loss. Extensive experiments and ablation analyses demonstrate the superiority of our method in terms of the quality of synthesis results, identity-irrelevant attributes preservation, deanonymization accuracy, and completeness of properties analogous to RSA. \ No newline at end of file diff --git a/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System b/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System new file mode 100644 index 0000000000..79256c8b25 --- /dev/null +++ b/data/2024/aaai/FacetCRS: Multi-Faceted Preference Learning for Pricking Filter Bubbles in Conversational Recommender System @@ -0,0 +1 @@ +The filter bubble is a notorious issue in Recommender Systems (RSs), which describes the phenomenon whereby users are exposed to a limited and narrow range of information or content that reinforces their existing dominant preferences and beliefs. This results in a lack of exposure to diverse and varied content. Many existing works have predominantly examined filter bubbles in static or relatively-static recommendation settings. However, filter bubbles will be continuously intensified over time due to the feedback loop between the user and the system in the real-world online recommendation. To address these issues, we propose a novel paradigm, Multi-Facet Preference Learning for Pricking Filter Bubbles in Conversational Recommender System (FacetCRS), which aims to burst filter bubbles in the conversational recommender system (CRS) through timely user-item interactions via natural language conversations. By considering diverse user preferences and intentions, FacetCRS automatically model user preference into multi-facets, including entity-, word-, context-, and review-facet, to capture diverse and dynamic user preferences to prick filter bubbles in the CRS. It is an end-to-end CRS framework to adaptively learn representations of various levels of preference facet and diverse types of external knowledge. Extensive experiments on two publicly available benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance in mitigating filter bubbles and enhancing recommendation quality in CRS. \ No newline at end of file diff --git a/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension b/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension new file mode 100644 index 0000000000..20b1ff6129 --- /dev/null +++ b/data/2024/aaai/Fact-Driven Logical Reasoning for Machine Reading Comprehension @@ -0,0 +1 @@ +Recent years have witnessed an increasing interest in training machines with reasoning ability, which deeply relies on accurately and clearly presented clue forms. The clues are usually modeled as entity-aware knowledge in existing studies. However, those entity-aware clues are primarily focused on commonsense, making them insufficient for tasks that require knowledge of temporary facts or events, particularly in logical reasoning for reading comprehension. To address this challenge, we are motivated to cover both commonsense and temporary knowledge clues hierarchically. Specifically, we propose a general formalism of knowledge units by extracting backbone constituents of the sentence, such as the subject-verb-object formed ``facts''. We then construct a supergraph on top of the fact units, allowing for the benefit of sentence-level (relations among fact groups) and entity-level interactions (concepts or actions inside a fact). Experimental results on logical reasoning benchmarks and dialogue modeling datasets show that our approach improves the baselines substantially, and it is general across backbone models. Code is available at https://github.com/ozyyshr/FocalReasoner. \ No newline at end of file diff --git a/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs b/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs new file mode 100644 index 0000000000..010577b2f4 --- /dev/null +++ b/data/2024/aaai/Factored Online Planning in Many-Agent POMDPs @@ -0,0 +1 @@ +In centralized multi-agent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation methods have been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to settings with many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-of-the-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents. \ No newline at end of file diff --git a/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning b/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning new file mode 100644 index 0000000000..195c7c4e88 --- /dev/null +++ b/data/2024/aaai/Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning @@ -0,0 +1 @@ +Unsupervised disentangled representation learning aims to recover semantically meaningful factors from real-world data without supervision, which is significant for model generalization and interpretability. Current methods mainly rely on assumptions of independence or informativeness of factors, regardless of interpretability. Intuitively, visually interpretable concepts better align with human-defined factors. However, exploiting visual interpretability as inductive bias is still under-explored. Inspired by the observation that most explanatory image factors can be represented by ``content + mask'', we propose a content-mask factorization network (CMFNet) to decompose an image into different groups of content codes and masks, which are further combined as content masks to represent different visual concepts. To ensure informativeness of the representations, the CMFNet is jointly learned with a generator conditioned on the content masks for reconstructing the input image. The conditional generator employs a diffusion model to leverage its robust distribution modeling capability. Our model is called the Factorized Diffusion Autoencoder (FDAE). To enhance disentanglement of visual concepts, we propose a content decorrelation loss and a mask entropy loss to decorrelate content masks in latent space and spatial space, respectively. Experiments on Shapes3d, MPI3D and Cars3d show that our method achieves advanced performance and can generate visually interpretable concept-specific masks. Source code and supplementary materials are available at https://github.com/wuancong/FDAE. \ No newline at end of file diff --git a/data/2024/aaai/Fair Allocation of Items in Multiple Regions b/data/2024/aaai/Fair Allocation of Items in Multiple Regions new file mode 100644 index 0000000000..906844f23e --- /dev/null +++ b/data/2024/aaai/Fair Allocation of Items in Multiple Regions @@ -0,0 +1 @@ +We initiate the study of fair allocation with the set of divisible or indivisible items distributed in multiple regions. The key requirement is that each agent can only obtain items from one region. In this work, we consider two kinds of fairness concepts: envy-based notions including envy-freeness (EF) and envy-freeness up to one/any item (EF1/EFX), and share-based notions including proportionality (PROP) and proportionality up to one/any item (PROP1/PROPX). On the negative side, we show NP-hardness and inapproximability results about the aforementioned fairness notions. On the positive side, we propose several algorithms to compute the partial allocations that satisfy envy-based notions and allocations that approximate the above fairness notions. \ No newline at end of file diff --git a/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks b/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks new file mode 100644 index 0000000000..e6f7b1b216 --- /dev/null +++ b/data/2024/aaai/Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks @@ -0,0 +1 @@ +Accurate prediction of water quality and quantity is crucial for sustainable development and human well-being. However, existing data-driven methods often suffer from spatial biases in model performance due to heterogeneous data, limited observations, and noisy sensor data. To overcome these challenges, we propose Fair-Graph, a novel graph-based recurrent neural network that leverages interrelated knowledge from multiple rivers to predict water flow and temperature within large-scale stream networks. Additionally, we introduce node-specific graph masks for information aggregation and adaptation to enhance prediction over heterogeneous river segments. To reduce performance disparities across river segments, we introduce a centralized coordination strategy that adjusts training priorities for segments. We evaluate the prediction of water temperature within the Delaware River Basin, and the prediction of streamflow using simulated data from U.S. National Water Model in the Houston River network. The results showcase improvements in predictive performance and highlight the proposed model's ability to maintain spatial fairness over different river segments. \ No newline at end of file diff --git a/data/2024/aaai/Fair Lotteries for Participatory Budgeting b/data/2024/aaai/Fair Lotteries for Participatory Budgeting new file mode 100644 index 0000000000..9c14a8d88b --- /dev/null +++ b/data/2024/aaai/Fair Lotteries for Participatory Budgeting @@ -0,0 +1 @@ +In pursuit of participatory budgeting (PB) outcomes with broader fairness guarantees, we initiate the study of lotteries over discrete PB outcomes. As the projects have heterogeneous costs, the amount spent may not be equal ex ante and ex post. To address this, we develop a technique to bound the amount by which the ex-post spend differs from the ex-ante spend---the property is termed budget balanced up to one project (BB1). With respect to fairness, we take a best-of-both-worlds perspective, seeking outcomes that are both ex-ante and ex-post fair. Towards this goal, we initiate a study of ex-ante fairness properties in PB, including Individual Fair Share (IFS), Unanimous Fair Share (UFS) and their stronger variants, as well as Group Fair Share (GFS). We show several incompatibility results between these ex-ante fairness notions and existing ex-post concepts based on justified representation. One of our main contributions is a randomized algorithm which simultaneously satisfies ex-ante Strong UFS, ex-post full justified representation (FJR) and ex-post BB1 for PB with binary utilities. \ No newline at end of file diff --git a/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency b/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency new file mode 100644 index 0000000000..d1dcc2d329 --- /dev/null +++ b/data/2024/aaai/Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency @@ -0,0 +1 @@ +Predictive analytics has been widely used in various domains, including education, to inform decision-making and improve outcomes. However, many predictive models are proprietary and inaccessible for evaluation or modification by researchers and practitioners, limiting their accountability and ethical design. Moreover, predictive models are often opaque and incomprehensible to the officials who use them, reducing their trust and utility. Furthermore, predictive models may introduce or exacerbate bias and inequity, as they have done in many sectors of society. Therefore, there is a need for transparent, interpretable, and fair predictive models that can be easily adopted and adapted by different stakeholders. In this paper, we propose a fair predictive model based on multivariate adaptive regression splines (MARS) that incorporates fairness measures in the learning process. MARS is a non-parametric regression model that performs feature selection, handles non-linear relationships, generates interpretable decision rules, and derives optimal splitting criteria on the variables. Specifically, we integrate fairness into the knot optimization algorithm and provide theoretical and empirical evidence of how it results in a fair knot placement. We apply our fairMARS model to real-world data and demonstrate its effectiveness in terms of accuracy and equity. Our paper contributes to the advancement of responsible and ethical predictive analytics for social good. \ No newline at end of file diff --git a/data/2024/aaai/Fair Participation via Sequential Policies b/data/2024/aaai/Fair Participation via Sequential Policies new file mode 100644 index 0000000000..39b44e0e16 --- /dev/null +++ b/data/2024/aaai/Fair Participation via Sequential Policies @@ -0,0 +1 @@ +Leading approaches to algorithmic fairness and policy-induced distribution shift are often misaligned with long-term objectives in sequential settings. We aim to correct these shortcomings by ensuring that both the objective and fairness constraints account for policy-induced distribution shift. First, we motivate this problem using an example in which individuals subject to algorithmic predictions modulate their willingness to participate with the policy maker. Fairness in this example is measured by the variance of group participation rates. Next, we develop a method for solving the resulting constrained, non-linear optimization problem and prove that this method converges to a fair, locally optimal policy given first-order information. Finally, we experimentally validate our claims in a semi-synthetic setting. \ No newline at end of file diff --git a/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract) b/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract) new file mode 100644 index 0000000000..6063b6f3e3 --- /dev/null +++ b/data/2024/aaai/Fair Representation Learning with Maximum Mean Discrepancy Distance Constraint (Student Abstract) @@ -0,0 +1 @@ +Unsupervised learning methods such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and autoencoding are regularly used in dimensionality reduction within the statistical learning scene. However, despite a pivot toward fairness and explainability in machine learning over the past few years, there have been few rigorous attempts toward a generalized framework of fair and explainable representation learning. Our paper explores the possibility of such a framework that leverages maximum mean discrepancy to remove information derived from a protected class from generated representations. For the optimization, we introduce a binary search component to optimize the Lagrangian coefficients. We present rigorous mathematical analysis and experimental results of our framework applied to t-SNE. \ No newline at end of file diff --git a/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism b/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism new file mode 100644 index 0000000000..152e11effb --- /dev/null +++ b/data/2024/aaai/Fair Sampling in Diffusion Models through Switching Mechanism @@ -0,0 +1,3 @@ +Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. +To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers. +We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data. \ No newline at end of file diff --git a/data/2024/aaai/Fair and Optimal Prediction via Post-Processing b/data/2024/aaai/Fair and Optimal Prediction via Post-Processing new file mode 100644 index 0000000000..034d456616 --- /dev/null +++ b/data/2024/aaai/Fair and Optimal Prediction via Post-Processing @@ -0,0 +1 @@ +In this talk I will discuss our recent work on characterizing the inherent tradeoff between fairness and accuracy in both classification and regression problems. I will also present a post-processing algorithm that derives optimal fair predictors from Bayes score functions. \ No newline at end of file diff --git a/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels b/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels new file mode 100644 index 0000000000..75da8ab129 --- /dev/null +++ b/data/2024/aaai/FairPlay: A Multi-Sided Fair Dynamic Pricing Policy for Hotels @@ -0,0 +1 @@ +In recent years, popular touristic destinations face overtourism. Local communities suffer from its consequences in several ways. Among others, overpricing and profiteering harms local societies and economies deeply. In this paper we focus on the problem of determining fair hotel room prices. Specifically, we put forward a dynamic pricing policy where the price of a room depends not only on the demand of the hotel it belongs to but also on the demand of: (i) similar rooms in the area and (ii) their hotels. To this purpose, we model our setting as a cooperative game and exploit an appropriate game theoretic solution concept that promotes fairness both on the customers' and the providers' side. Our simulation results involving price adjustments across real-world hotels datasets, confirm that ours is a fair dynamic pricing policy, avoiding both over- and under-pricing hotel rooms. \ No newline at end of file diff --git a/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization b/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization new file mode 100644 index 0000000000..db64f0d307 --- /dev/null +++ b/data/2024/aaai/FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization @@ -0,0 +1 @@ +Despite the remarkable success of graph neural networks (GNNs) in modeling graph-structured data, like other machine learning models, GNNs are also susceptible to making biased predictions based on sensitive attributes, such as race and gender. For fairness consideration, recent state-of-the-art (SOTA) methods propose to filter out sensitive information from inputs or representations, e.g., edge dropping or feature masking. However, we argue that such filtering-based strategies may also filter out some non-sensitive feature information, leading to a sub-optimal trade-off between predictive performance and fairness. To address this issue, we unveil an innovative neutralization-based paradigm, where additional Fairness-facilitating Features (F3) are incorporated into node features or representations before message passing. The F3 are expected to statistically neutralize the sensitive bias in node representations and provide additional nonsensitive information. We also provide theoretical explanations for our rationale, concluding that F3 can be realized by emphasizing the features of each node’s heterogeneous neighbors (neighbors with different sensitive attributes). We name our method as FairSIN, and present three implementation variants from both data-centric and model-centric perspectives. Experimental results on five benchmark datasets with three different GNN backbones show that FairSIN significantly improves fairness metrics while maintaining high prediction accuracies. Codes and appendix can be found at https://github.com/BUPT-GAMMA/FariSIN. \ No newline at end of file diff --git a/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning b/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning new file mode 100644 index 0000000000..58080fa570 --- /dev/null +++ b/data/2024/aaai/FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning @@ -0,0 +1 @@ +As Federated Learning (FL) gains prominence in distributed machine learning applications, achieving fairness without compromising predictive performance becomes paramount. The data being gathered from distributed clients in an FL environment often leads to class imbalance. In such scenarios, balanced accuracy rather than accuracy is the true representation of model performance. However, most state-of-the-art fair FL methods report accuracy as the measure of performance, which can lead to misguided interpretations of the model's effectiveness to mitigate discrimination. To the best of our knowledge, this work presents the first attempt towards achieving Pareto-optimal trade-offs between balanced accuracy and fairness in a federated environment (FairTrade). By utilizing multi-objective optimization, the framework negotiates the intricate balance between model's balanced accuracy and fairness. The framework's agnostic design adeptly accommodates both statistical and causal fairness notions, ensuring its adaptability across diverse FL contexts. We provide empirical evidence of our framework's efficacy through extensive experiments on five real-world datasets and comparisons with six baselines. The empirical results underscore the potential of our framework in improving the trade-off between fairness and balanced accuracy in FL applications. \ No newline at end of file diff --git a/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples b/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples new file mode 100644 index 0000000000..61c88607a1 --- /dev/null +++ b/data/2024/aaai/Fairness under Covariate Shift: Improving Fairness-Accuracy Tradeoff with Few Unlabeled Test Samples @@ -0,0 +1 @@ +Covariate shift in the test data is a common practical phenomena that can significantly downgrade both the accuracy and the fairness performance of the model. Ensuring fairness across different sensitive groups under covariate shift is of paramount importance due to societal implications like criminal justice. We operate in the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available. Towards improving fairness under this highly challenging yet realistic scenario, we make three contributions. First is a novel composite weighted entropy based objective for prediction accuracy which is optimized along with a representation matching loss for fairness. We experimentally verify that optimizing with our loss formulation outperforms a number of state-of-the-art baselines in the pareto sense with respect to the fairness-accuracy tradeoff on several standard datasets. Our second contribution is a new setting we term Asymmetric Covariate Shift that, to the best of our knowledge, has not been studied before. Asymmetric covariate shift occurs when distribution of covariates of one group shifts significantly compared to the other groups and this happens when a dominant group is over-represented. While this setting is extremely challenging for current baselines, We show that our proposed method significantly outperforms them. Our third contribution is theoretical, where we show that our weighted entropy term along with prediction loss on the training set approximates test loss under covariate shift. Empirically and through formal sample complexity bounds, we show that this approximation to the unseen test loss does not depend on importance sampling variance which affects many other baselines. \ No newline at end of file diff --git a/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment b/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment new file mode 100644 index 0000000000..68ae3e3008 --- /dev/null +++ b/data/2024/aaai/Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment @@ -0,0 +1 @@ +Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistics. Most assume the availability of class label which is impractical in many real-world applications such as precision medicine, actuarial analysis and recidivism prediction. To this end, this talk revisits fairness and reveals idiosyncrasies of existing fairness literature assuming the availability of class label that limits their real-world utility. The primary artifacts are formulating fairness with censorship to account for scenarios where the class label is not guaranteed, and a suite of corresponding new fairness notions, algorithms, and theoretical constructs to bridge the gap between the design of a ``fair'' model in the lab and its deployment in the real-world. \ No newline at end of file diff --git a/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing b/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing new file mode 100644 index 0000000000..88eefa4122 --- /dev/null +++ b/data/2024/aaai/Fairness without Demographics through Shared Latent Space-Based Debiasing @@ -0,0 +1 @@ +Ensuring fairness in machine learning (ML) is crucial, particularly in applications that impact diverse populations. The majority of existing works heavily rely on the availability of protected features like race and gender. However, practical challenges such as privacy concerns and regulatory restrictions often prohibit the use of this data, limiting the scope of traditional fairness research. To address this, we introduce a Shared Latent Space-based Debiasing (SLSD) method that transforms data from both the target domain, which lacks protected features, and a separate source domain, which contains these features, into correlated latent representations. This allows for joint training of a cross-domain protected group estimator on the representations. We then debias the downstream ML model with an adversarial learning technique that leverages the group estimator. We also present a relaxed variant of SLSD, the R-SLSD, that occasionally accesses a small subset of protected features from the target domain during its training phase. Our extensive experiments on benchmark datasets demonstrate that our methods consistently outperform existing state-of-the-art models in standard group fairness metrics. \ No newline at end of file diff --git a/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers b/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers new file mode 100644 index 0000000000..7174b6f3ed --- /dev/null +++ b/data/2024/aaai/Fairness-Aware Structured Pruning in Transformers @@ -0,0 +1 @@ +The increasing size of large language models (LLMs) has introduced challenges in their training and inference. Removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of LLMs: model fairness. It is crucial to address the fairness of LLMs towards diverse groups, such as women, Black people, LGBTQ+, Jewish communities, among others, as they are being deployed and available to a wide audience. In this work, first, we investigate how attention heads impact fairness and performance in pre-trained transformer-based language models. We then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. Our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. Our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for DistilGPT-2, GPT-2, GPT-Neo of two different sizes, GPT-J, and Llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance. WARNING: This work uses language that is offensive in nature. \ No newline at end of file diff --git a/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract) b/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract) new file mode 100644 index 0000000000..2547818965 --- /dev/null +++ b/data/2024/aaai/Faithful Trip Recommender Using Diffusion Guidance (Student Abstract) @@ -0,0 +1 @@ +Trip recommendation aims to plan user’s travel based on their specified preferences. Traditional heuristic and statistical approaches often fail to capture the intricate nuances of user intentions, leading to subpar performance. Recent deep-learning methods show attractive accuracy but struggle to generate faithful trajectories that match user intentions. In this work, we propose a DDPM-based incremental knowledge injection module to ensure the faithfulness of the generated trajectories. Experiments on two datasets verify the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval b/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval new file mode 100644 index 0000000000..bacdcd2b24 --- /dev/null +++ b/data/2024/aaai/FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval @@ -0,0 +1 @@ +The goal of composed fashion image retrieval is to locate a target image based on a reference image and modified text. Recent methods utilize symmetric encoders (e.g., CLIP) pre-trained on large-scale non-fashion datasets. However, the input for this task exhibits an asymmetric nature, where the reference image contains rich content while the modified text is often brief. Therefore, methods employing symmetric encoders encounter a severe phenomenon: retrieval results dominated by reference images, leading to the oversight of modified text. We propose a Fashion Enhance-and-Refine Network (FashionERN) centered around two aspects: enhancing the text encoder and refining visual semantics. We introduce a Triple-branch Modifier Enhancement model, which injects relevant information from the reference image and aligns the modified text modality with the target image modality. Furthermore, we propose a Dual-guided Vision Refinement model that retains critical visual information through text-guided refinement and self-guided refinement processes. The combination of these two models significantly mitigates the reference dominance phenomenon, ensuring accurate fulfillment of modifier requirements. Comprehensive experiments demonstrate our approach's state-of-the-art performance on four commonly used datasets. \ No newline at end of file diff --git a/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications b/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications new file mode 100644 index 0000000000..f97583e3e6 --- /dev/null +++ b/data/2024/aaai/Fast & Fair: A Collaborative Platform for Fair Division Applications @@ -0,0 +1 @@ +Fair division, the study of how to fairly allocate resources among agents, has received substantial interest in the areas of artificial intelligence and multiagent systems. While there is an extensive theoretical literature on fair division by now, the developed algorithms are still mostly confined to research papers and inaccessible to the public. We attempt to bridge this gap by developing Fast & Fair, an open-source web application that hosts a number of fair allocation algorithms with user-friendly interfaces and explainable outcomes. In contrast to existing implementations, Fast & Fair is a collaborative platform that is open to community contributions and thereby facilitates the deployment of additional algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement b/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement new file mode 100644 index 0000000000..30e736ef74 --- /dev/null +++ b/data/2024/aaai/Fast Inter-frame Motion Prediction for Compressed Dynamic Point Cloud Attribute Enhancement @@ -0,0 +1 @@ +Recent years have witnessed the success of deep learning methods in quality enhancement of compressed point cloud. However, existing methods focus on geometry and attribute enhancement of single-frame point cloud. This paper proposes a novel compressed quality enhancement method for dynamic point cloud (DAE-MP). Specifically, we propose a fast inter-frame motion prediction module (IFMP) to explicitly estimate motion displacement and achieve inter-frame feature alignment. To maintain motion continuity between consecutive frames, we propose a motion consistency loss for supervised learning. Furthermore, a frequency component separation and fusion module is designed to extract rich frequency features adaptively. To the best of our knowledge, the proposed method is the first deep learning-based work to enhance the quality for compressed dynamic point cloud. Experimental results show that the proposed method can greatly improve the quality of compressed dynamic point cloud and provide a fast and efficient motion prediction plug-in for large-scale point cloud. For dynamic point cloud attribute with severely compressed artifact, our proposed DAE-MP method achieves up to 0.52dB (PSNR) performance gain. Moreover, the proposed IFMP module has a certain real-time processing ability for calculating the motion offset between dynamic point cloud frame. \ No newline at end of file diff --git a/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening b/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening new file mode 100644 index 0000000000..2deebcab0a --- /dev/null +++ b/data/2024/aaai/Fast Machine Unlearning without Retraining through Selective Synaptic Dampening @@ -0,0 +1 @@ +Machine unlearning, the ability for a machine learning model to forget, is becoming increasingly important to comply with data privacy regulations, as well as to remove harmful, manipulated, or outdated information. The key challenge lies in forgetting specific information while protecting model performance on the remaining data. While current state-of-the-art methods perform well, they typically require some level of retraining over the retained data, in order to protect or restore model performance. This adds computational overhead and mandates that the training data remain available and accessible, which may not be feasible. In contrast, other methods employ a retrain-free paradigm, however, these approaches are prohibitively computationally expensive and do not perform on par with their retrain-based counterparts. We present Selective Synaptic Dampening (SSD), a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data. First, SSD uses the Fisher information matrix of the training and forgetting data to select parameters that are disproportionately important to the forget set. Second, SSD induces forgetting by dampening these parameters proportional to their relative importance to the forget set with respect to the wider training data. We evaluate our method against several existing unlearning methods in a range of experiments using ResNet18 and Vision Transformer. Results show that the performance of SSD is competitive with retrain-based post hoc methods, demonstrating the viability of retrain-free post hoc unlearning approaches. \ No newline at end of file diff --git a/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes b/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes new file mode 100644 index 0000000000..c90d5aea6a --- /dev/null +++ b/data/2024/aaai/Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes @@ -0,0 +1 @@ +Neural network sparsity has attracted many research interests due to its similarity to biological schemes and high energy efficiency. However, existing methods depend on long-time training or fine-tuning, which prevents large-scale applications. Recently, some works focusing on post-training sparsity (PTS) have emerged. They get rid of the high training cost but usually suffer from distinct accuracy degradation due to neglect of the reasonable sparsity rate at each layer. Previous methods for finding sparsity rates mainly focus on the training-aware scenario, which usually fails to converge stably under the PTS setting with limited data and much less training cost. In this paper, we propose a fast and controllable post-training sparsity (FCPTS) framework. By incorporating a differentiable bridge function and a controllable optimization objective, our method allows for rapid and accurate sparsity allocation learning in minutes, with the added assurance of convergence to a predetermined global sparsity rate. Equipped with these techniques, we can surpass the state-of-the-art methods by a large margin, e.g., over 30\% improvement for ResNet-50 on ImageNet under the sparsity rate of 80\%. Our plug-and-play code and supplementary materials are open-sourced at https://github.com/ModelTC/FCPTS. \ No newline at end of file diff --git a/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract) b/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract) new file mode 100644 index 0000000000..65a897de65 --- /dev/null +++ b/data/2024/aaai/Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract) @@ -0,0 +1 @@ +We develop a method of adapting the AlphaZero model to General Game Playing (GGP) that focuses on faster model generation and requires less knowledge to be extracted from the game rules. The dataset generation uses MCTS playing instead of self-play; only the value network is used, and attention layers replace the convolutional ones. This allows us to abandon any assumptions about the action space and board topology. We implement the method within the Regular Boardgames GGP system and show that we can build models outperforming the UCT baseline for most games efficiently. \ No newline at end of file diff --git a/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization b/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization new file mode 100644 index 0000000000..b178a5ed76 --- /dev/null +++ b/data/2024/aaai/Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization @@ -0,0 +1 @@ +This paper delves into the realm of stochastic optimization for compositional minimax optimization—a pivotal challenge across various machine learning domains, including deep AUC and reinforcement learning policy evaluation. Despite its significance, the problem of compositional minimax optimization is still under-explored. Adding to the complexity, current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes. To respond to these constraints, this paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity and obtain the nearly accuracy solution, matching the existing minimax methods. We also demonstrate that NSTORM can achieve the same sample complexity under the Polyak-Lojasiewicz (PL)-condition—an insightful extension of its capabilities. Yet, NSTORM encounters an issue with its requirement for low learning rates, potentially constraining its real-world applicability in machine learning. To overcome this hurdle, we present ADAptive NSTORM (ADA-NSTORM) with adaptive learning rates. We demonstrate that ADA-NSTORM can achieve the same sample complexity but the experimental results show its more effectiveness. All the proposed complexities indicate that our proposed methods can match lower bounds to existing minimax optimizations, without requiring a large batch size in each iteration. Extensive experiments support the efficiency of our proposed methods. \ No newline at end of file diff --git a/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging b/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging new file mode 100644 index 0000000000..25a1c11cb5 --- /dev/null +++ b/data/2024/aaai/FeatWalk: Enhancing Few-Shot Classification through Local View Leveraging @@ -0,0 +1 @@ +Few-shot learning is a challenging task due to the limited availability of training samples. Recent few-shot learning studies with meta-learning and simple transfer learning methods have achieved promising performance. However, the feature extractor pre-trained with the upstream dataset may neglect the extraction of certain features which could be crucial for downstream tasks. In this study, inspired by the process of human learning in few-shot tasks, where humans not only observe the whole image (`global view') but also attend to various local image regions (`local view') for comprehensive understanding of detailed features, we propose a simple yet effective few-shot learning method called FeatWalk which can utilize the complementary nature of global and local views, therefore providing an intuitive and effective solution to the problem of insufficient local information extraction from the pre-trained feature extractor. Our method can be easily and flexibly combined with various existing methods, further enhancing few-shot learning performance. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and versatility of our method.The source code is available at https://github.com/exceefind/FeatWalk. \ No newline at end of file diff --git a/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection b/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection new file mode 100644 index 0000000000..71c8646664 --- /dev/null +++ b/data/2024/aaai/Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection @@ -0,0 +1 @@ +Training neural networks with good generalization requires large computational costs in many deep learning methods due to large-scale datasets and over-parameterized models. Despite the emergence of a number of coreset selection methods to reduce the computational costs, the problem of coreset distribution bias, i.e., the skewed distribution between the coreset and the entire dataset, has not been well studied. In this paper, we find that the closer the feature distribution of the coreset is to that of the entire dataset, the better the generalization performance of the coreset, particularly under extreme pruning. This motivates us to propose a simple yet effective method for coreset selection to alleviate the distribution bias between the coreset and the entire dataset, called feature distribution matching (FDMat). Unlike gradient-based methods, which selects samples with larger gradient values or approximates gradient values of the entire dataset, FDMat aims to select coreset that is closest to feature distribution of the entire dataset. Specifically, FDMat transfers coreset selection as an optimal transport problem from the coreset to the entire dataset in feature embedding spaces. Moreover, our method shows strong robustness due to the removal of samples far from the distribution, especially for the entire dataset containing noisy and class-imbalanced samples. Extensive experiments on multiple benchmarks show that FDMat can improve the performance of coreset selection than existing coreset methods. The code is available at https://github.com/successhaha/FDMat. \ No newline at end of file diff --git a/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition b/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition new file mode 100644 index 0000000000..042c276e01 --- /dev/null +++ b/data/2024/aaai/Feature Fusion from Head to Tail for Long-Tailed Visual Recognition @@ -0,0 +1 @@ +The imbalanced distribution of long-tailed data presents a considerable challenge for deep learning models, as it causes them to prioritize the accurate classification of head classes but largely disregard tail classes. The biased decision boundary caused by inadequate semantic information in tail classes is one of the key factors contributing to their low recognition accuracy. To rectify this issue, we propose to augment tail classes by grafting the diverse semantic information from head classes, referred to as head-to-tail fusion (H2T). We replace a portion of feature maps from tail classes with those belonging to head classes. These fused features substantially enhance the diversity of tail classes. Both theoretical analysis and practical experimentation demonstrate that H2T can contribute to a more optimized solution for the decision boundary. We seamlessly integrate H2T in the classifier adjustment stage, making it a plug-and-play module. Its simplicity and ease of implementation allow for smooth integration with existing long-tailed recognition methods, facilitating a further performance boost. Extensive experiments on various long-tailed benchmarks demonstrate the effectiveness of the proposed H2T. The source code is available at https://github.com/Keke921/H2T. \ No newline at end of file diff --git a/data/2024/aaai/Feature Transportation Improves Graph Neural Networks b/data/2024/aaai/Feature Transportation Improves Graph Neural Networks new file mode 100644 index 0000000000..217836e8ee --- /dev/null +++ b/data/2024/aaai/Feature Transportation Improves Graph Neural Networks @@ -0,0 +1,3 @@ +Graph neural networks (GNNs) have shown remarkable success in learning representations for graph-structured data. However, GNNs still face challenges in modeling complex phenomena that involve feature transportation. In this paper, we propose a novel GNN architecture inspired by Advection-Diffusion-Reaction systems, called ADR-GNN. +Advection models feature transportation, while diffusion captures the local smoothing of features, and reaction represents the non-linear transformation between feature channels. We provide an analysis of the qualitative behavior of ADR-GNN, that shows the benefit of combining advection, diffusion, and reaction. +To demonstrate its efficacy, we evaluate ADR-GNN on real-world node classification and spatio-temporal datasets, and show that it improves or offers competitive performance compared to state-of-the-art networks. \ No newline at end of file diff --git a/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs b/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs new file mode 100644 index 0000000000..3a48c7d0ab --- /dev/null +++ b/data/2024/aaai/Feature Unlearning for Pre-trained GANs and VAEs @@ -0,0 +1 @@ +We tackle the problem of feature unlearning from a pre-trained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST, CelebA, and FFHQ datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties. \ No newline at end of file diff --git a/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers b/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers new file mode 100644 index 0000000000..5576f00f4b --- /dev/null +++ b/data/2024/aaai/FedCD: Federated Semi-Supervised Learning with Class Awareness Balance via Dual Teachers @@ -0,0 +1 @@ +Recent advancements in deep learning have greatly improved the efficiency of auxiliary medical diagnostics. However, concerns over patient privacy and data annotation costs restrict the viability of centralized training models. In response, federated semi-supervised learning has garnered substantial attention from medical institutions. However, it faces challenges arising from knowledge discrepancies among local clients and class imbalance in non-independent and identically distributed data. Existing methods like class balance adaptation for addressing class imbalance often overlook low-confidence yet valuable rare samples in unlabeled data and may compromise client privacy. To address these issues, we propose a novel framework with class awareness balance and dual teacher distillation called FedCD. FedCD introduces a global-local framework to balance and purify global and local knowledge. Additionally, we introduce a novel class awareness balance module to effectively explore potential rare classes and encourage balanced learning in unlabeled clients. Importantly, our approach prioritizes privacy protection by only exchanging network parameters during communication. Experimental results on two medical datasets under various settings demonstrate the effectiveness of FedCD. The code is available at https://github.com/YunzZ-Liu/FedCD. \ No newline at end of file diff --git a/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning b/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning new file mode 100644 index 0000000000..49274b3dc4 --- /dev/null +++ b/data/2024/aaai/FedCSL: A Scalable and Accurate Approach to Federated Causal Structure Learning @@ -0,0 +1 @@ +As an emerging research direction, federated causal structure learning (CSL) aims at learning causal relationships from decentralized data across multiple clients while preserving data privacy. Existing federated CSL algorithms suffer from scalability and accuracy issues, since they require computationally expensive CSL algorithms to be executed at each client. Furthermore, in real-world scenarios, the number of samples held by each client varies significantly, and existing methods still assign equal weights to the learned structural information from each client, which severely harms the learning accuracy of those methods. To address these two limitations, we propose FedCSL, a scalable and accurate method for federated CSL. Specifically, FedCSL consists of two novel strategies: (1) a federated local-to-global learning strategy that enables FedCSL to scale to high-dimensional data for tackling the scalability issue, and (2) a novel weighted aggregation strategy that does not rely on any complex encryption techniques while preserving data privacy for tackling the accuracy issue. Extensive experiments on benchmark datasets, high-dimensional synthetic datasets and a real-world dataset verify the efficacy of the proposed FedCSL method. The source code is available at https://github.com/Xianjie-Guo/FedCSL. \ No newline at end of file diff --git a/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants b/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants new file mode 100644 index 0000000000..567090c9da --- /dev/null +++ b/data/2024/aaai/FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants @@ -0,0 +1 @@ +Federated learning (FL) provides a privacy-preserving approach for collaborative training of machine learning models. Given the potential data heterogeneity, it is crucial to select appropriate collaborators for each FL participant (FL-PT) based on data complementarity. Recent studies have addressed this challenge. Similarly, it is imperative to consider the inter-individual relationships among FL-PTs where some FL-PTs engage in competition. Although FL literature has acknowledged the significance of this scenario, practical methods for establishing FL ecosystems remain largely unexplored. In this paper, we extend a principle from the balance theory, namely “the friend of my enemy is my enemy”, to ensure the absence of conflicting interests within an FL ecosystem. The extended principle and the resulting problem are formulated via graph theory and integer linear programming. A polynomial-time algorithm is proposed to determine the collaborators of each FL-PT. The solution guarantees high scalability, allowing even competing FL-PTs to smoothly join the ecosystem without conflict of interest. The proposed framework jointly considers competition and data heterogeneity. Extensive experiments on real-world and synthetic data demonstrate its efficacy compared to five alternative approaches, and its ability to establish efficient collaboration networks among FL-PTs. \ No newline at end of file diff --git a/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning b/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning new file mode 100644 index 0000000000..6a0f3285e7 --- /dev/null +++ b/data/2024/aaai/FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning @@ -0,0 +1 @@ +Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL. \ No newline at end of file diff --git a/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels b/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels new file mode 100644 index 0000000000..92b0a45444 --- /dev/null +++ b/data/2024/aaai/FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels @@ -0,0 +1 @@ +Federated Learning with Noisy Labels (F-LNL) aims at seeking an optimal server model via collaborative distributed learning by aggregating multiple client models trained with local noisy or clean samples. On the basis of a federated learning framework, recent advances primarily adopt label noise filtering to separate clean samples from noisy ones on each client, thereby mitigating the negative impact of label noise. However, these prior methods do not learn noise filters by exploiting knowledge across all clients, leading to sub-optimal and inferior noise filtering performance and thus damaging training stability. In this paper, we present FedDiv to tackle the challenges of F-LNL. Specifically, we propose a global noise filter called Federated Noise Filter for effectively identifying samples with noisy labels on every client, thereby raising stability during local training sessions. Without sacrificing data privacy, this is achieved by modeling the global distribution of label noise across all clients. Then, in an effort to make the global model achieve higher performance, we introduce a Predictive Consistency based Sampler to identify more credible local data for local model training, thus preventing noise memorization and further boosting the training stability. Extensive experiments on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that FedDiv achieves superior performance over state-of-the-art F-LNL methods under different label noise settings for both IID and non-IID data partitions. Source code is publicly available at https://github.com/lijichang/FLNL-FedDiv. \ No newline at end of file diff --git a/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning b/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning new file mode 100644 index 0000000000..55e822ecd7 --- /dev/null +++ b/data/2024/aaai/FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning @@ -0,0 +1 @@ +Federated Learning (FL) heavily depends on label quality for its performance. However, the label distribution among individual clients is always both noisy and heterogeneous. The high loss incurred by client-specific samples in heterogeneous label noise poses challenges for distinguishing between client-specific and noisy label samples, impacting the effectiveness of existing label noise learning approaches. To tackle this issue, we propose FedFixer, where the personalized model is introduced to cooperate with the global model to effectively select clean client-specific samples. In the dual models, updating the personalized model solely at a local level can lead to overfitting on noisy data due to limited samples, consequently affecting both the local and global models’ performance. To mitigate overfitting, we address this concern from two perspectives. Firstly, we employ a confidence regularizer to alleviate the impact of unconfident predictions caused by label noise. Secondly, a distance regularizer is implemented to constrain the disparity between the personalized and global models. We validate the effectiveness of FedFixer through extensive experiments on benchmark datasets. The results demonstrate that FedFixer can perform well in filtering noisy label samples on different clients, especially in highly heterogeneous label noise scenarios. \ No newline at end of file diff --git a/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting b/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting new file mode 100644 index 0000000000..e53dd95c44 --- /dev/null +++ b/data/2024/aaai/FedGCR: Achieving Performance and Fairness for Federated Learning with Distinct Client Types via Group Customization and Reweighting @@ -0,0 +1 @@ +To achieve better performance and greater fairness in Federated Learning (FL), much of the existing research has centered on individual clients, using domain adaptation techniques and redesigned aggregation schemes to counteract client data heterogeneity. However, an overlooked scenario exists where clients belong to distinctive groups, or, client types, in which groups of clients share similar characteristics such as device specifications or data patterns. Despite being common in group collaborations, this scenario has been overlooked in previous research, potentially leading to performance degradation and systemic biases against certain client types. To bridge this gap, we introduce Federated learning with Group Customization and Reweighting (FedGCR). FedGCR enhances both performance and fairness for FL with Distinct Client Types, consisting of a Federated Group Customization (FedGC) model to provide customization via a novel prompt tuning technique to mitigate the data disparity across different client-types, and a Federated Group Reweighting (FedGR) aggregation scheme to ensure uniform and unbiased performances between clients and between client types by a novel reweighting approach. Extensive experiment comparisons with prior FL methods in domain adaptation and fairness demonstrate the superiority of FedGCR in all metrics, including the overall accuracy and performance uniformity in both the group and the individual level. FedGCR achieves 82.74% accuracy and 12.26(↓) in performance uniformity on the Digit-Five dataset and 81.88% and 14.88%(↓) on DomainNet with a domain imbalance factor of 10, which significantly outperforms the state-of-the-art. Code is available at https://github.com/celinezheng/fedgcr. \ No newline at end of file diff --git a/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning b/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning new file mode 100644 index 0000000000..a51b520588 --- /dev/null +++ b/data/2024/aaai/FedLF: Layer-Wise Fair Federated Learning @@ -0,0 +1 @@ +Fairness has become an important concern in Federated Learning (FL). An unfair model that performs well for some clients while performing poorly for others can reduce the willingness of clients to participate. In this work, we identify a direct cause of unfairness in FL - the use of an unfair direction to update the global model, which favors some clients while conflicting with other clients’ gradients at the model and layer levels. To address these issues, we propose a layer-wise fair Federated Learning algorithm (FedLF). Firstly, we formulate a multi-objective optimization problem with an effective fair-driven objective for FL. A layer-wise fair direction is then calculated to mitigate the model and layer-level gradient conflicts and reduce the improvement bias. We further provide the theoretical analysis on how FedLF can improve fairness and guarantee convergence. Extensive experiments on different learning tasks and models demonstrate that FedLF outperforms the SOTA FL algorithms in terms of accuracy and fairness. The source code is available at https://github.com/zibinpan/FedLF. \ No newline at end of file diff --git a/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing b/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing new file mode 100644 index 0000000000..3d4da5ba67 --- /dev/null +++ b/data/2024/aaai/FedLPS: Heterogeneous Federated Learning for Multiple Tasks with Local Parameter Sharing @@ -0,0 +1 @@ +Federated Learning (FL) has emerged as a promising solution in Edge Computing (EC) environments to process the proliferation of data generated by edge devices. By collaboratively optimizing the global machine learning models on distributed edge devices, FL circumvents the need for transmitting raw data and enhances user privacy. Despite practical successes, FL still confronts significant challenges including constrained edge device resources, multiple tasks deployment, and data heterogeneity. However, existing studies focus on mitigating the FL training costs of each single task whereas neglecting the resource consumption across multiple tasks in heterogeneous FL scenarios. In this paper, we propose Heterogeneous Federated Learning with Local Parameter Sharing (FedLPS) to fill this gap. FedLPS leverages principles from transfer learning to facilitate the deployment of multiple tasks on a single device by dividing the local model into a shareable encoder and task-specific encoders. To further reduce resource consumption, a channel-wise model pruning algorithm that shrinks the footprint of local models while accounting for both data and system heterogeneity is employed in FedLPS. Additionally, a novel heterogeneous model aggregation algorithm is proposed to aggregate the heterogeneous predictors in FedLPS. We implemented the proposed FedLPS on a real FL platform and compared it with state-of-the-art (SOTA) FL frameworks. The experimental results on five popular datasets and two modern DNN models illustrate that the proposed FedLPS significantly outperforms the SOTA FL frameworks by up to 4.88% and reduces the computational resource consumption by 21.3%. Our code is available at: https://github.com/jyzgh/FedLPS. \ No newline at end of file diff --git a/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation b/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation new file mode 100644 index 0000000000..7c03e3a976 --- /dev/null +++ b/data/2024/aaai/FedMut: Generalized Federated Learning via Stochastic Mutation @@ -0,0 +1 @@ +Although Federated Learning (FL) enables collaborative model training without sharing the raw data of clients, it encounters low-performance problems caused by various heterogeneous scenarios. Due to the limitation of dispatching the same global model to clients for local training, traditional Federated Average (FedAvg)-based FL models face the problem of easily getting stuck into a sharp solution, which results in training a low-performance global model. To address this problem, this paper presents a novel FL approach named FedMut, which mutates the global model according to the gradient change to generate several intermediate models for the next round of training. Each intermediate model will be dispatched to a client for local training. Eventually, the global model converges into a flat area within the range of mutated models and has a well-generalization compared with the global model trained by FedAvg. Experimental results on well-known datasets demonstrate the effectiveness of our FedMut approach in various data heterogeneity scenarios. \ No newline at end of file diff --git a/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning b/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning new file mode 100644 index 0000000000..54f48c3bde --- /dev/null +++ b/data/2024/aaai/FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning @@ -0,0 +1 @@ +Recent Newton-type federated learning algorithms have demonstrated linear convergence with respect to the communication rounds. However, communicating Hessian matrices is often unfeasible due to their quadratic communication complexity. In this paper, we introduce a novel approach to tackle this issue while still achieving fast convergence rates. Our proposed method, named as Federated Newton Sketch methods (FedNS), approximates the centralized Newton's method by communicating the sketched square-root Hessian instead of the exact Hessian. To enhance communication efficiency, we reduce the sketch size to match the effective dimension of the Hessian matrix. We provide convergence analysis based on statistical learning for the federated Newton sketch approaches. Specifically, our approaches reach super-linear convergence rates w.r.t. the communication rounds for the first time. We validate the effectiveness of our algorithms through various experiments, which coincide with our theoretical findings. \ No newline at end of file diff --git a/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation b/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation new file mode 100644 index 0000000000..62fdabed77 --- /dev/null +++ b/data/2024/aaai/FedST: Federated Style Transfer Learning for Non-IID Image Segmentation @@ -0,0 +1 @@ +Federated learning collaboratively trains machine learning models among different clients while keeping data privacy and has become the mainstream for breaking data silos. However, the non-independently and identically distribution (i.e., Non-IID) characteristic of different image domains among different clients reduces the benefits of federated learning and has become a bottleneck problem restricting the accuracy and generalization of federated models. In this work, we propose a novel federated image segmentation method based on style transfer, FedST, by using a denoising diffusion probabilistic model to achieve feature disentanglement and image synthesis of cross-domain image data between multiple clients. Thus it can share style features among clients while protecting structure features of image data, which effectively alleviates the influence of the Non-IID phenomenon. Experiments prove that our method achieves superior segmentation performance compared to state-of-art methods among four different Non-IID datasets in objective and subjective assessment. The code is available at https://github.com/YoferChen/FedST. \ No newline at end of file diff --git a/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning b/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning new file mode 100644 index 0000000000..962db0623f --- /dev/null +++ b/data/2024/aaai/FedTGP: Trainable Global Prototypes with Adaptive-Margin-Enhanced Contrastive Learning for Data and Model Heterogeneity in Federated Learning @@ -0,0 +1 @@ +Recently, Heterogeneous Federated Learning (HtFL) has attracted attention due to its ability to support heterogeneous models and data. To reduce the high communication cost of transmitting model parameters, a major challenge in HtFL, prototype-based HtFL methods are proposed to solely share class representatives, a.k.a, prototypes, among heterogeneous clients while maintaining the privacy of clients’ models. However, these prototypes are naively aggregated into global prototypes on the server using weighted averaging, resulting in suboptimal global knowledge which negatively impacts the performance of clients. To overcome this challenge, we introduce a novel HtFL approach called FedTGP, which leverages our Adaptive-margin-enhanced Contrastive Learning (ACL) to learn Trainable Global Prototypes (TGP) on the server. By incorporating ACL, our approach enhances prototype separability while preserving semantic meaning. Extensive experiments with twelve heterogeneous models demonstrate that our FedTGP surpasses state-of-the-art methods by up to 9.08% in accuracy while maintaining the communication and privacy advantages of prototype-based HtFL. Our code is available at https://github.com/TsingZ0/FedTGP. \ No newline at end of file diff --git a/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning b/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning new file mode 100644 index 0000000000..f59ed572a3 --- /dev/null +++ b/data/2024/aaai/Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning @@ -0,0 +1 @@ +Federated learning (FL) enables multiple clients to collaboratively train a global model without disclosing their data. Previous researches often require training the complete model parameters. However, the emergence of powerful pre-trained models makes it possible to achieve higher performance with fewer learnable parameters in FL. In this paper, we propose a federated adaptive prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification with powerful foundation models, like CLIP. Compared with direct federated prompt tuning, our core idea is to adaptively unlock specific domain knowledge for each test sample in order to provide them with personalized prompts. To implement this idea, we design an adaptive prompt tuning module, which consists of a meta prompt, an adaptive network, and some keys. The server randomly generates a set of keys and assigns a unique key to each client. Then all clients cooperatively train the global adaptive network and meta prompt with the local datasets and the frozen keys. Ultimately, the global aggregation model can assign a personalized prompt to CLIP based on the domain features of each test sample. We perform extensive experiments on two multi-domain image classification datasets across two different settings -- supervised and unsupervised. The results show that FedAPT can achieve better performance with less than 10% of the number of parameters of the fully trained model, and the global model can perform well in diverse client domains simultaneously. \ No newline at end of file diff --git a/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization b/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization new file mode 100644 index 0000000000..2392e42ead --- /dev/null +++ b/data/2024/aaai/Federated Causality Learning with Explainable Adaptive Optimization @@ -0,0 +1 @@ +Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (FedCausal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, FedCausal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that FedCausal can effectively deal with non-independently and identically distributed (non-iid) data and has a superior performance. \ No newline at end of file diff --git a/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users b/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users new file mode 100644 index 0000000000..94154169e1 --- /dev/null +++ b/data/2024/aaai/Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users @@ -0,0 +1 @@ +We study the problem of federated contextual combinatorial cascading bandits, where agents collaborate under the coordination of a central server to provide tailored recommendations to users. Existing works consider either a synchronous framework, necessitating full agent participation and global synchronization, or assume user homogeneity with identical behaviors. We overcome these limitations by considering (1) federated agents operating in an asynchronous communication paradigm, where no mandatory synchronization is required and all agents communicate independently with the server, (2) heterogeneous user behaviors, where users can be stratified into latent user clusters, each exhibiting distinct preferences. For this setting, we propose a UCB-type algorithm with delicate communication protocols. Through theoretical analysis, we give sub-linear regret bounds on par with those achieved in the synchronous framework, while incurring only logarithmic communication costs. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs. \ No newline at end of file diff --git a/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes b/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes new file mode 100644 index 0000000000..0b7dd5ab2d --- /dev/null +++ b/data/2024/aaai/Federated Graph Learning under Domain Shift with Generalizable Prototypes @@ -0,0 +1 @@ +Federated Graph Learning is a privacy-preserving collaborative approach for training a shared model on graph-structured data in the distributed environment. However, in real-world scenarios, the client graph data usually originate from diverse domains, this unavoidably hinders the generalization performance of the final global model. To address this challenge, we start the first attempt to investigate this scenario by learning a well-generalizable model. In order to improve the performance of the global model from different perspectives, we propose a novel framework called Federated Graph Learning with Generalizable Prototypes (FGGP). It decouples the global model into two levels and bridges them via prototypes. These prototypes, which are semantic centers derived from the feature extractor, can provide valuable classification information. At the classification model level, we innovatively eschew the traditional classifiers, then instead leverage clustered prototypes to capture fruitful domain information and enhance the discriminative capability of the classes, improving the performance of multi-domain predictions. Furthermore, at the feature extractor level, we go beyond traditional approaches by implicitly injecting distinct global knowledge and employing contrastive learning to obtain more powerful prototypes while enhancing the feature extractor generalization ability. Experimental results on various datasets are presented to validate the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization b/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization new file mode 100644 index 0000000000..2804ed5526 --- /dev/null +++ b/data/2024/aaai/Federated Label-Noise Learning with Local Diversity Product Regularization @@ -0,0 +1,10 @@ +Training data in federated learning (FL) frameworks can have label noise, since they must be stored and annotated on clients' devices. +If trained over such corrupted data, the models learn the wrong knowledge of label noise, which highly degrades their performance. +Although several FL schemes are designed to combat label noise, they suffer performance degradation when the clients' devices only have limited local training samples. +To this end, a new scheme called federated label-noise learning (FedLNL) is developed in this paper. +The key problem of FedLNL is how to estimate a noise transition matrix (NTM) accurately in the case of limited local training samples. +If a gradient-based update method is used to update the local NTM on each client's device, it can generate too large gradients for the local NTM, causing a high estimation error of the local NTM. +To tackle this issue, an alternating update method for the local NTM and the local classifier is designed in FedLNL, where the local NTM is updated by a Bayesian inference-based update method. +Such an alternating update method makes the loss function of existing NTM-based schemes not applicable to FedLNL. +To enable federated optimization of FedLNL, a new regularizer on the parameters of the classifier called local diversity product regularizer is designed for the loss function of FedLNL. +The results show that FedLNL improves the test accuracy of a trained model by up to 25.98%, compared with the state-of-the-art FL schemes that tackle label-noise issues. \ No newline at end of file diff --git a/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation b/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation new file mode 100644 index 0000000000..257d402014 --- /dev/null +++ b/data/2024/aaai/Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation @@ -0,0 +1 @@ +Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, it is not uncommon that some FL participants only possess a subset of the complete imaging modalities, posing inter-modal heterogeneity as a challenge to effectively training a global model on all participants’ data. In addition, each participant would expect to obtain a personalized model tailored for its local data characteristics from the FL in such a scenario. In this work, we propose a new FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address the two concurrent issues. Above all, FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity in the first place. In the meantime, while the encoders are shared by the participants, the decoders are personalized to meet individual needs. Specifically, a server with full-modal data employs a fusion decoder to aggregate and fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation reversely. Meanwhile, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the encoder parameters. On the other end, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up the information loss due to absent modalities while adapting the representations of present ones. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation. Results show that it outperforms various up-to-date methods for multimodal and personalized FL and that its novel designs are effective. Our code is available. \ No newline at end of file diff --git a/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization b/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization new file mode 100644 index 0000000000..3df96f653d --- /dev/null +++ b/data/2024/aaai/Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization @@ -0,0 +1 @@ +Partial label learning (PLL) expands the applicability of supervised machine learning models by enabling effective learning from weakly annotated overcomplete labels. Existing PLL methods however focus on the standard centralized learning scenarios. In this paper, we expand PLL into the distributed computation setting by formalizing a new learning scenario named as federated partial label learning (FedPLL), where the training data with partial labels are distributed across multiple local clients with privacy constraints. To address this challenging problem, we propose a novel Federated PLL method with Local-Adaptive Augmentation and Regularization (FedPLL-LAAR). In addition to alleviating the partial label noise with moving-average label disambiguation, the proposed method performs MixUp-based local-adaptive data augmentation to mitigate the challenge posed by insufficient and imprecisely annotated local data, and dynamically incorporates the guidance of global model to minimize client drift through adaptive gradient alignment regularization between the global and local models. Extensive experiments conducted on multiple datasets under the FedPLL setting demonstrate the effectiveness of the proposed FedPLL-LAAR method for federated partial label learning. \ No newline at end of file diff --git a/data/2024/aaai/Federated X-armed Bandit b/data/2024/aaai/Federated X-armed Bandit new file mode 100644 index 0000000000..cbaf168c0e --- /dev/null +++ b/data/2024/aaai/Federated X-armed Bandit @@ -0,0 +1 @@ +This work establishes the first framework of federated X-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named Fed-PNE. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of Fed-PNE over various centralized and federated baseline algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection b/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection new file mode 100644 index 0000000000..9220babf05 --- /dev/null +++ b/data/2024/aaai/Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection @@ -0,0 +1 @@ +Logical anomalies (LA) refer to data violating underlying logical constraints e.g., the quantity, arrangement, or composition of components within an image. Detecting accurately such anomalies requires models to reason about various component types through segmentation. However, curation of pixel-level annotations for semantic segmentation is both time-consuming and expensive. Although there are some prior few-shot or unsupervised co-part segmentation algorithms, they often fail on images with industrial object. These images have components with similar textures and shapes, and a precise differentiation proves challenging. In this study, we introduce a novel component segmentation model for LA detection that leverages a few labeled samples and unlabeled images sharing logical constraints. To ensure consistent segmentation across unlabeled images, we employ a histogram matching loss in conjunction with an entropy loss. As segmentation predictions play a crucial role, we propose to enhance both local and global sample validity detection by capturing key aspects from visual semantics via three memory banks: class histograms, component composition embeddings and patch-level representations. For effective LA detection, we propose an adaptive scaling strategy to standardize anomaly scores from different memory banks in inference. Extensive experiments on the public benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA detection vs. 89.6% from competing methods. \ No newline at end of file diff --git a/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI b/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI new file mode 100644 index 0000000000..e462607fbc --- /dev/null +++ b/data/2024/aaai/Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI @@ -0,0 +1 @@ +Detecting human-object interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student’s learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5- way 5-shot task. \ No newline at end of file diff --git a/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models b/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models new file mode 100644 index 0000000000..d7bd6e9e9a --- /dev/null +++ b/data/2024/aaai/Few-Shot Learning via Repurposing Ensemble of Black-Box Models @@ -0,0 +1 @@ +This paper investigates the problem of exploiting existing solution models of previous tasks to address a related target task with limited training data. Existing approaches addressing this problem often require access to the internal parameterization of the existing solution models and possibly their training data, which is not possible in many practical settings. To relax this requirement, We approach this problem from a new perspective of black-box re-purposing, which augments the target inputs and leverages their corresponding outputs generated by existing black-box APIs into a feature ensemble. We hypothesize that such feature ensemble can be learned to incorporate and encode relevant black-box knowledge into the feature representation of target data, which will compensate for their scarcity. This hypothesis is confirmed via the reported successes of our proposed black-box ensemble in solving multiple few-shot learning tasks derived from various benchmark datasets. All reported results show consistently that the set of heterogeneous black-box solutions of previous tasks can indeed be reused and combined effectively to solve a reasonably related target task without requiring access to a large training dataset. This is the first step towards enabling new possibilities to further supplement existing techniques in transfer or meta learning with black-box knowledge. \ No newline at end of file diff --git a/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination b/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination new file mode 100644 index 0000000000..4fb1f2ffca --- /dev/null +++ b/data/2024/aaai/Few-Shot Neural Radiance Fields under Unconstrained Illumination @@ -0,0 +1 @@ +In this paper, we introduce a new challenge for synthesizing novel view images in practical environments with limited input multi-view images and varying lighting conditions. Neural radiance fields (NeRF), one of the pioneering works for this task, demand an extensive set of multi-view images taken under constrained illumination, which is often unattainable in real-world settings. While some previous works have managed to synthesize novel views given images with different illumination, their performance still relies on a substantial number of input multi-view images. To address this problem, we suggest ExtremeNeRF, which utilizes multi-view albedo consistency, supported by geometric alignment. Specifically, we extract intrinsic image components that should be illumination-invariant across different views, enabling direct appearance comparison between the input and novel view under unconstrained illumination. We offer thorough experimental results for task evaluation, employing the newly created NeRF Extreme benchmark—the first in-the-wild benchmark for novel view synthesis under multiple viewing directions and varying illuminations. \ No newline at end of file diff --git a/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language b/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language new file mode 100644 index 0000000000..1d9fcfc03e --- /dev/null +++ b/data/2024/aaai/Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language @@ -0,0 +1,2 @@ +Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. +Extensive experiments on three challenging datasets demonstrate its effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks b/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks new file mode 100644 index 0000000000..2cbd6bc175 --- /dev/null +++ b/data/2024/aaai/Find the Lady: Permutation and Re-synchronization of Deep Neural Networks @@ -0,0 +1,2 @@ +Deep neural networks are characterized by multiple symmetrical, equi-loss solutions that are redundant. Thus, the order of neurons in a layer and feature maps can be given arbitrary permutations, without affecting (or minimally affecting) their output. If we shuffle these neurons, or if we apply to them some perturbations (like fine-tuning) can we put them back in the original order i.e. re-synchronize? Is there a possible corruption threat? Answering these questions is important for applications like neural network white-box watermarking for ownership tracking and integrity verification. +We advance a method to re-synchronize the order of permuted neurons. Our method is also effective if neurons are further altered by parameter pruning, quantization, and fine-tuning, showing robustness to integrity attacks. Additionally, we provide theoretical and practical evidence for the usual means to corrupt the integrity of the model, resulting in a solution to counter it. We test our approach on popular computer vision datasets and models, and we illustrate the threat and our countermeasure on a popular white-box watermarking method. \ No newline at end of file diff --git a/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream b/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream new file mode 100644 index 0000000000..7533d335a3 --- /dev/null +++ b/data/2024/aaai/Finding Visual Saliency in Continuous Spike Stream @@ -0,0 +1 @@ +As a bio-inspired vision sensor, the spike camera emulates the operational principles of the fovea, a compact retinal region, by employing spike discharges to encode the accumulation of per-pixel luminance intensity. Leveraging its high temporal resolution and bio-inspired neuromorphic design, the spike camera holds significant promise for advancing computer vision applications. Saliency detection mimic the behavior of human beings and capture the most salient region from the scenes. In this paper, we investigate the visual saliency in the continuous spike stream for the first time. To effectively process the binary spike stream, we propose a Recurrent Spiking Transformer (RST) framework, which is based on a full spiking neural network. Our framework enables the extraction of spatio-temporal features from the continuous spatio-temporal spike stream while maintaining low power consumption. To facilitate the training and validation of our proposed model, we build a comprehensive real-world spike-based visual saliency dataset, enriched with numerous light conditions. Extensive experiments demonstrate the superior performance of our Recurrent Spiking Transformer framework in comparison to other spike neural network-based methods. Our framework exhibits a substantial margin of improvement in capturing and highlighting visual saliency in the spike stream, which not only provides a new perspective for spike-based saliency segmentation but also shows a new paradigm for full SNN-based transformer models. The code and dataset are available at https://github.com/BIT-Vision/SVS. \ No newline at end of file diff --git "a/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems" "b/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems" new file mode 100644 index 0000000000..83a62a489e --- /dev/null +++ "b/data/2024/aaai/Finding \316\265 and \316\264 of Traditional Disclosure Control Systems" @@ -0,0 +1 @@ +This paper analyzes the privacy of traditional Statistical Disclosure Control (SDC) systems under a differential privacy interpretation. SDCs, such as cell suppression and swapping, promise to safeguard the confidentiality of data and are routinely adopted in data analyses with profound societal and economic impacts. Through a formal analysis and empirical evaluation of demographic data from real households in the U.S., the paper shows that widely adopted SDC systems not only induce vastly larger privacy losses than classical differential privacy mechanisms, but, they may also come at a cost of larger accuracy and fairness. \ No newline at end of file diff --git a/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction b/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction new file mode 100644 index 0000000000..0bd09e34c6 --- /dev/null +++ b/data/2024/aaai/Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction @@ -0,0 +1,2 @@ +Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. +Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval b/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval new file mode 100644 index 0000000000..c839a79a91 --- /dev/null +++ b/data/2024/aaai/Fine-Grained Distillation for Long Document Retrieval @@ -0,0 +1 @@ +Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning b/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning new file mode 100644 index 0000000000..c076d8a74c --- /dev/null +++ b/data/2024/aaai/Fine-Grained Knowledge Selection and Restoration for Non-exemplar Class Incremental Learning @@ -0,0 +1,2 @@ +Non-exemplar class incremental learning aims to learn both the new and old tasks without accessing any training data from the past. This strict restriction enlarges the difficulty of alleviating catastrophic forgetting since all techniques can only be applied to current task data. Considering this challenge, we propose a novel framework of fine-grained knowledge selection and restoration. The conventional knowledge distillation-based methods place too strict constraints on the network parameters and features to prevent forgetting, which limits the training of new tasks. To loose this constraint, we proposed a novel fine-grained selective patch-level distillation to adaptively balance plasticity and stability. Some task-agnostic patches can be used to preserve the decision boundary of the old task. While some patches containing the important foreground are favorable for learning the new task. + Moreover, we employ a task-agnostic mechanism to generate more realistic prototypes of old tasks with the current task sample for reducing classifier bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100, TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method. Code is available at https://github.com/scok30/vit-cil. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering b/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering new file mode 100644 index 0000000000..28770b309f --- /dev/null +++ b/data/2024/aaai/Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering @@ -0,0 +1 @@ +Reconstructing high-fidelity hand models with intricate textures plays a crucial role in enhancing human-object interaction and advancing real-world applications. Despite the state-of-the-art methods excelling in texture generation and image rendering, they often face challenges in accurately capturing geometric details. Learning-based approaches usually offer better robustness and faster inference, which tend to produce smoother results and require substantial amounts of training data. To address these issues, we present a novel fine-grained multi-view hand mesh reconstruction method that leverages inverse rendering to restore hand poses and intricate details. Firstly, our approach predicts a parametric hand mesh model through Graph Convolutional Networks (GCN) based method from multi-view images. We further introduce a novel Hand Albedo and Mesh (HAM) optimization module to refine both the hand mesh and textures, which is capable of preserving the mesh topology. In addition, we suggest an effective mesh-based neural rendering scheme to simultaneously generate photo-realistic image and optimize mesh geometry by fusing the pre-trained rendering network with vertex features. We conduct the comprehensive experiments on InterHand2.6M, DeepHandMesh and dataset collected by ourself, whose promising results show that our proposed approach outperforms the state-of-the-art methods on both reconstruction accuracy and rendering quality. Code and dataset are publicly available at https://github.com/agnJason/FMHR. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection b/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection new file mode 100644 index 0000000000..0032b84405 --- /dev/null +++ b/data/2024/aaai/Fine-Grained Prototypes Distillation for Few-Shot Object Detection @@ -0,0 +1 @@ +Few-shot object detection (FSOD) aims at extending a generic detector for novel object detection with only a few training examples. It attracts great concerns recently due to the practical meanings. Meta-learning has been demonstrated to be an effective paradigm for this task. In general, methods based on meta-learning employ an additional support branch to encode novel examples (a.k.a. support images) into class prototypes, which are then fused with query branch to facilitate the model prediction. However, the class-level prototypes are difficult to precisely generate, and they also lack detailed information, leading to instability in performance. New methods are required to capture the distinctive local context for more robust novel object detection. To this end, we propose to distill the most representative support features into fine-grained prototypes. These prototypes are then assigned into query feature maps based on the matching results, modeling the detailed feature relations between two branches. This process is realized by our Fine-Grained Feature Aggregation (FFA) module. Moreover, in terms of high-level feature fusion, we propose Balanced Class-Agnostic Sampling (B-CAS) strategy and Non-Linear Fusion (NLF) module from differenct perspectives. They are complementary to each other and depict the high-level feature relations more effectively. Extensive experiments on PASCAL VOC and MS COCO benchmarks show that our method sets a new state-of-the-art performance in most settings. Our code is available at https://github.com/wangchen1801/FPD. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns b/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns new file mode 100644 index 0000000000..2286789009 --- /dev/null +++ b/data/2024/aaai/Fine-Tuning Graph Neural Networks by Preserving Graph Generative Patterns @@ -0,0 +1,8 @@ +Recently, the paradigm of pre-training and fine-tuning graph neural networks has been intensively studied and applied in a wide range of graph mining tasks. +Its success is generally attributed to the structural consistency between pre-training and downstream datasets, which, however, does not hold in many real-world scenarios. +Existing works have shown that the structural divergence between pre-training and downstream graphs significantly limits the transferability when using the vanilla fine-tuning strategy. This divergence leads to model overfitting on pre-training graphs and causes difficulties in capturing the structural properties of the downstream graphs. +In this paper, we identify the fundamental cause of structural divergence as the discrepancy of generative patterns between the pre-training and downstream graphs. +Furthermore, we propose G-Tuning to preserve the generative patterns of downstream graphs. +Given a downstream graph G, the core idea is to tune the pre-trained GNN so that it can reconstruct the generative patterns of G, the graphon W. +However, the exact reconstruction of a graphon is known to be computationally expensive. To overcome this challenge, we provide a theoretical analysis that establishes the existence of a set of alternative graphons called graphon bases for any given graphon. By utilizing a linear combination of these graphon bases, we can efficiently approximate W. This theoretical finding forms the basis of our model, as it enables effective learning of the graphon bases and their associated coefficients. +Compared with existing algorithms, G-Tuning demonstrates consistent performance improvement in 7 in-domain and 7 out-of-domain transfer learning experiments. \ No newline at end of file diff --git a/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward b/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward new file mode 100644 index 0000000000..218020f739 --- /dev/null +++ b/data/2024/aaai/Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward @@ -0,0 +1 @@ +Large language model-based explainable recommendation (LLM-based ER) systems can provide remarkable human-like explanations and have widely received attention from researchers. However, the original LLM-based ER systems face three low-quality problems in their generated explanations, i.e., lack of personalization, inconsistency, and questionable explanation data. To address these problems, we propose a novel LLM-based ER model denoted as LLM2ER to serve as a backbone and devise two innovative explainable quality reward models for fine-tuning such a backbone in a reinforcement learning paradigm, ultimately yielding a fine-tuned model denoted as LLM2ER-EQR, which can provide high-quality explanations. LLM2ER-EQR can generate personalized, informative, and consistent high-quality explanations learned from questionable-quality explanation datasets. Extensive experiments conducted on three real-world datasets demonstrate that our model can generate fluent, diverse, informative, and highly personalized explanations. \ No newline at end of file diff --git a/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract) b/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract) new file mode 100644 index 0000000000..ba4b2c7553 --- /dev/null +++ b/data/2024/aaai/Finetuning LLMs for Automatic Concept to TTI Prompt Generation (Student Abstract) @@ -0,0 +1 @@ +Our work explores bridging the gap between large language models and text-to-image models to create a tool for quickly and easily generating high quality images from a given concept. In our experiments we successfully improved image quality with only a preliminary utilization of the available resources for finetuning. \ No newline at end of file diff --git a/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs b/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs new file mode 100644 index 0000000000..d6f0371ad2 --- /dev/null +++ b/data/2024/aaai/Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs @@ -0,0 +1 @@ +We study the multi-agent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (eps-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that eps-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of eps-MATS compared with existing algorithms in the same setting. \ No newline at end of file diff --git a/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering b/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering new file mode 100644 index 0000000000..a5891111f5 --- /dev/null +++ b/data/2024/aaai/FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering @@ -0,0 +1 @@ +Knowledge base question answering (KBQA) is a critical yet challenging task due to the vast number of entities within knowledge bases and the diversity of natural language questions posed by users. Unfortunately, the performance of most KBQA models tends to decline significantly in real-world scenarios where high-quality annotated data is insufficient. To mitigate the burden associated with manual annotation, we introduce FlexKBQA by utilizing Large Language Models (LLMs) as program translators for addressing the challenges inherent in the few-shot KBQA task. Specifically, FlexKBQA leverages automated algorithms to sample diverse programs, such as SPARQL queries, from the knowledge base, which are subsequently converted into natural language questions via LLMs. This synthetic dataset facilitates training a specialized lightweight model for the KB. Additionally, to reduce the barriers of distribution shift between synthetic data and real user questions, FlexKBQA introduces an executionguided self-training method to iterative leverage unlabeled user questions. Furthermore, we explore harnessing the inherent reasoning capability of LLMs to enhance the entire framework. Consequently, FlexKBQA delivers substantial flexibility, encompassing data annotation, deployment, and being domain agnostic. Through extensive experiments on GrailQA, WebQSP, and KQA Pro, we observe that under the few-shot even the more challenging zero-shot scenarios, FlexKBQA achieves impressive results with a few annotations, surpassing all previous baselines and even approaching the performance of supervised models, achieving a remarkable 93% performance relative to the fully-supervised models. We posit that FlexKBQA represents a significant advancement towards exploring better integration of large and lightweight models. Code is available at https://github.com/leezythu/FlexKBQA. \ No newline at end of file diff --git a/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint) b/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint) new file mode 100644 index 0000000000..68866dfcad --- /dev/null +++ b/data/2024/aaai/FlexiBO: A Decoupled Cost-Aware Multi-objective Optimization Approach for Deep Neural Networks (Abstract Reprint) @@ -0,0 +1 @@ +The design of machine learning systems often requires trading off different objectives, for example, prediction error and energy consumption for deep neural networks (DNNs). Typically, no single design performs well in all objectives; therefore, finding Pareto-optimal designs is of interest. The search for Pareto-optimal designs involves evaluating designs in an iterative process, and the measurements are used to evaluate an acquisition function that guides the search process. However, measuring different objectives incurs different costs. For example, the cost of measuring the prediction error of DNNs is orders of magnitude higher than that of measuring the energy consumption of a pre-trained DNN as it requires re-training the DNN. Current state-of-the-art methods do not consider this difference in objective evaluation cost, potentially incurring expensive evaluations of objective functions in the optimization process. In this paper, we develop a novel decoupled and cost-aware multi-objective optimization algorithm, which we call Flexible Multi-Objective Bayesian Optimization (FlexiBO) to address this issue. For evaluating each design, FlexiBO selects the objective with higher relative gain by weighting the improvement of the hypervolume of the Pareto region with the measurement cost of each objective. This strategy, therefore, balances the expense of collecting new information with the knowledge gained through objective evaluations, preventing FlexiBO from performing expensive measurements for little to no gain. We evaluate FlexiBO on seven state-of-the-art DNNs for image recognition, natural language processing (NLP), and speech-to-text translation. Our results indicate that, given the same total experimental budget, FlexiBO discovers designs with 4.8% to 12.4% lower hypervolume error than the best method in state-of-the-art multi-objective optimization. \ No newline at end of file diff --git a/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping b/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping new file mode 100644 index 0000000000..cdbfb42161 --- /dev/null +++ b/data/2024/aaai/Flood Insights: Integrating Remote and Social Sensing Data for Flood Exposure, Damage, and Urgent Needs Mapping @@ -0,0 +1 @@ +The absence of comprehensive situational awareness information poses a significant challenge for humanitarian organizations during their response efforts. We present Flood Insights, an end-to-end system that ingests data from multiple non-traditional data sources such as remote sensing, social sensing, and geospatial data. We employ state-of-the-art natural language processing and computer vision models to identify flood exposure, ground-level damage and flood reports, and most importantly, urgent needs of affected people. We deploy and test the system during a recent real-world catastrophe, the 2022 Pakistan floods, to surface critical situational and damage information at the district level. We validated the system's effectiveness through geographic regression analysis using official ground-truth data, showcasing its strong performance and explanatory power. Moreover, the system was commended by the United Nations Development Programme stationed in Pakistan, as well as local authorities, for pinpointing hard-hit districts and enhancing disaster response. \ No newline at end of file diff --git a/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution b/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution new file mode 100644 index 0000000000..8dba40ea99 --- /dev/null +++ b/data/2024/aaai/Flow-Event Autoencoder: Event Stream Object Recognition Dataset Generation with Arbitrary High Temporal Resolution @@ -0,0 +1 @@ +Event camera has unique advantages in high temporal resolution and dynamic range and has shown potentials in several computer vision tasks. However, due to the novelty of this hardware, there’s a lack of large benchmark DVS event-stream datasets, including datasets for object recognition. In this work, we proposed an encoder-decoder method to augment event stream dataset from image and optical flow with arbitrary temporal resolution for object recognition task. We believe this proposed method can be generalized well in augmenting event stream vision data for object recognition and will help advance the development of event vision paradigm. \ No newline at end of file diff --git a/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models b/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models new file mode 100644 index 0000000000..42459d83fd --- /dev/null +++ b/data/2024/aaai/Fluctuation-Based Adaptive Structured Pruning for Large Language Models @@ -0,0 +1,2 @@ +Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive +Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP. \ No newline at end of file diff --git a/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation b/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation new file mode 100644 index 0000000000..e85b086054 --- /dev/null +++ b/data/2024/aaai/FoSp: Focus and Separation Network for Early Smoke Segmentation @@ -0,0 +1 @@ +Early smoke segmentation (ESS) enables the accurate identification of smoke sources, facilitating the prompt extinguishing of fires and preventing large-scale gas leaks. But ESS poses greater challenges than conventional object and regular smoke segmentation due to its small scale and transparent appearance, which can result in high miss detection rate and low precision. To address these issues, a Focus and Separation Network (FoSp) is proposed. We first introduce a Focus module employing bidirectional cascade which guides low-resolution and high-resolution features towards mid-resolution to locate and determine the scope of smoke, reducing the miss detection rate. Next, we propose a Separation module that separates smoke images into a pure smoke foreground and a smoke-free background, enhancing the contrast between smoke and background fundamentally, improving segmentation precision. Finally, a Domain Fusion module is developed to integrate the distinctive features of the two modules which can balance recall and precision to achieve high F_beta. Futhermore, to promote the development of ESS, we introduce a high-quality real-world dataset called SmokeSeg, which contains more small and transparent smoke than the existing datasets. Experimental results show that our model achieves the best performance on three available smoke segmentation datasets: SYN70K (mIoU: 83.00%), SMOKE5K (F_beta: 81.6%) and SmokeSeg (F_beta: 72.05%). The code can be found at https://github.com/LujianYao/FoSp. \ No newline at end of file diff --git a/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning b/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..8678b8d5ee --- /dev/null +++ b/data/2024/aaai/FoX: Formation-Aware Exploration in Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Recently, deep multi-agent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks. \ No newline at end of file diff --git a/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly b/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly new file mode 100644 index 0000000000..ef75a99cab --- /dev/null +++ b/data/2024/aaai/FocalDreamer: Text-Driven 3D Editing via Focal-Fusion Assembly @@ -0,0 +1 @@ +While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations. \ No newline at end of file diff --git a/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects b/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects new file mode 100644 index 0000000000..f821fdd5af --- /dev/null +++ b/data/2024/aaai/Focus Stacking with High Fidelity and Superior Visual Effects @@ -0,0 +1 @@ +Focus stacking is a technique in computational photography, and it synthesizes a single all-in-focus image from different focal plane images. It is difficult for previous works to produce a high-quality all-in-focus image that meets two goals: high-fidelity to its source images and good visual effects without defects or abnormalities. This paper proposes a novel method based on optical imaging process analysis and modeling. Based on a foreground segmentation - diffusion elimination architecture, the foreground segmentation makes most of the areas in full-focus images heritage information from the source images to achieve high fidelity; diffusion elimination models the physical imaging process and is specially used to solve the transition region (TR) problem that is a long-term neglected issue and degrades visual effects of synthesized images. Based on extensive experiments on simulated dataset, existing realistic dataset and our proposed BetaFusion dataset, the results show that our proposed method can generate high-quality all-in-focus images by achieving two goals simultaneously, especially can successfully solve the TR problem and eliminate the visual effect degradation of synthesized images caused by the TR problem. \ No newline at end of file diff --git a/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning b/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning new file mode 100644 index 0000000000..676552a8fa --- /dev/null +++ b/data/2024/aaai/Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning @@ -0,0 +1,3 @@ +Visual Reinforcement Learning (RL) is a promising approach to achieve human-like intelligence. However, it currently faces challenges in learning efficiently within noisy environments. In contrast, humans can quickly identify task-relevant objects in distraction-filled surroundings by applying previously acquired common knowledge. Recently, foundational models in natural language processing and computer vision have achieved remarkable successes, and the common knowledge within these models can significantly benefit downstream task training. Inspired by these achievements, we aim to incorporate common knowledge from foundational models into visual RL. We propose a novel Focus-Then-Decide (FTD) framework, allowing the agent to make decisions based solely on task-relevant objects. To achieve this, we introduce an attention mechanism to select task-relevant objects from the object set returned by a foundational segmentation model, and only use the task-relevant objects for the subsequent training of the decision module. Additionally, we specifically employed two generic self-supervised objectives to facilitate the rapid learning of this attention mechanism. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that our method can quickly and accurately pinpoint objects of interest in noisy environments. Consequently, it achieves a significant performance improvement over current state-of-the-art algorithms. +Project Page: https://www.lamda.nju.edu.cn/chenc/FTD.html +Code: https://github.com/LAMDA-RL/FTD \ No newline at end of file diff --git a/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos b/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos new file mode 100644 index 0000000000..0f576d4b37 --- /dev/null +++ b/data/2024/aaai/Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos @@ -0,0 +1 @@ +Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations b/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations new file mode 100644 index 0000000000..87d8c7de51 --- /dev/null +++ b/data/2024/aaai/Forecasting Bimanual Object Manipulation Sequences from Unimanual Observations @@ -0,0 +1 @@ +Learning to forecast bimanual object manipulation sequences from unimanual observations has broad applications in assistive robots and augmented reality. This challenging task requires us to first infer motion from the missing arm and the object it would have been manipulating were the person bimanual, then forecast the human and object motion while maintaining hand-object contact during manipulation. Previous attempts model the hand-object interactions only implicitly, and thus tend to produce unrealistic motion where the objects float in air. We address this with a novel neural network that (i) identifies and forecasts the pose for only the objects undergoing motion through an object motion module and (ii) refines human pose predictions by encouraging hand-object contact during manipulation through an ensemble of human pose predictors. The components are also designed to be generic enough for use in both unimanual and bimanual contexts. Our approach outperforms the state-of-the-art pose forecasting methods on bimanual manipulation datasets. \ No newline at end of file diff --git a/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference b/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference new file mode 100644 index 0000000000..3d47c4f738 --- /dev/null +++ b/data/2024/aaai/Formal Logic Enabled Personalized Federated Learning through Property Inference @@ -0,0 +1 @@ +Recent advancements in federated learning (FL) have greatly facilitated the development of decentralized collaborative applications, particularly in the domain of Artificial Intelligence of Things (AIoT). However, a critical aspect missing from the current research landscape is the ability to enable data-driven client models with symbolic reasoning capabilities. Specifically, the inherent heterogeneity of participating client devices poses a significant challenge, as each client exhibits unique logic reasoning properties. Failing to consider these device-specific specifications can result in critical properties being missed in the client predictions, leading to suboptimal performance. In this work, we propose a new training paradigm that leverages temporal logic reasoning to address this issue. Our approach involves enhancing the training process by incorporating mechanically generated logic expressions for each FL client. Additionally, we introduce the concept of aggregation clusters and develop a partitioning algorithm to effectively group clients based on the alignment of their temporal reasoning properties. We evaluate the proposed method on two tasks: a real-world traffic volume prediction task consisting of sensory data from fifteen states and a smart city multi-task prediction utilizing synthetic data. The evaluation results exhibit clear improvements, with performance accuracy improved by up to 54% across all sequential prediction models. \ No newline at end of file diff --git a/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms b/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms new file mode 100644 index 0000000000..566b4cd8eb --- /dev/null +++ b/data/2024/aaai/Fostering Trustworthiness in Machine Learning Algorithms @@ -0,0 +1 @@ +Recent years have seen a surge in research that develops and applies machine learning algorithms to create intelligent learning systems. However, traditional machine learning algorithms have primarily focused on optimizing accuracy and efficiency, and they often fail to consider how to foster trustworthiness in their design. As a result, machine learning models usually face a trust crisis in real-world applications. Driven by these urgent concerns about trustworthiness, in this talk, I will introduce my research efforts towards the goal of making machine learning trustworthy. Specifically, I will delve into the following key research topics: security vulnerabilities and robustness, model explanations, and privacy-preserving mechanisms. \ No newline at end of file diff --git a/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10 b/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10 new file mode 100644 index 0000000000..e80b98685d --- /dev/null +++ b/data/2024/aaai/Foundations of Autonomous Vehicles: A Curriculum Model for Developing Competencies in Artificial Intelligence and the Internet of Things for Grades 7-10 @@ -0,0 +1,2 @@ +A few states (e.g., Maryland, Georgia, and Florida) have initiated efforts to incorporate artificial intelligence outcomes in K-12 education but others are still relying on informal spaces for learning and literacy in this area. In this manuscript, we share the curriculum and content of an informal effort focused on students in grades 7-10. We combined artificial intelligence competencies with Internet of Things skills to enable meaningful learning covering all Five Big Ideas in AI. In our one-week summer camp, students experimented with perceptions by working with vision, infrared, and ultrasonic sensors. They learned about representation through work with neural network playgrounds. Students engaged in supervised learning of an image processing model and used the model to control the actions of a robot car. Natural interactions and societal impacts were assessed as students observed the robot car's behavior. +Results demonstrate that our curriculum was successful in achieving its objectives. Excluding the robot car kit, the curriculum was created using free platforms and tools. This program could be replicated in informal settings by any educator or collaborator with a computer science background. This paper describes our summer camp curriculum, its components and their implementation, the lessons learned, and potential future enhancements. \ No newline at end of file diff --git a/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications b/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications new file mode 100644 index 0000000000..b659bddddb --- /dev/null +++ b/data/2024/aaai/Foundations of Reactive Synthesis for Declarative Process Specifications @@ -0,0 +1 @@ +Given a specification of Linear-time Temporal Logic interpreted over finite traces (LTLf), the reactive synthesis problem asks to find a finitely-representable, terminating controller that reacts to the uncontrollable actions of an environment in order to enforce a desired system specification. In this paper we study, for the first time, the foundations of reactive synthesis for DECLARE, a well-established declarative, pattern-based business process modelling language grounded in LTLf. We provide a threefold contribution. First, we define a reactive synthesis problem for DECLARE. Second, we show how an arbitrary DECLARE specification can be polynomially encoded into an equivalent pure-past one in LTLf, and exploit this to define an EXPTIME algorithm for DECLARE synthesis. Third, we derive a symbolic version of this algorithm, by introducing a novel translation of pure-past temporal formulas into symbolic deterministic finite automata. \ No newline at end of file diff --git a/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing b/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing new file mode 100644 index 0000000000..1bcacd7c24 --- /dev/null +++ b/data/2024/aaai/Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing @@ -0,0 +1 @@ +Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-of-Information (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning (RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields b/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields new file mode 100644 index 0000000000..11f77601bd --- /dev/null +++ b/data/2024/aaai/Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields @@ -0,0 +1 @@ +This paper presents an approach to frame semantic role labeling (FSRL), a task in natural language processing that identifies semantic roles within a text following the theory of frame semantics. Unlike previous approaches which do not adequately model correlations and interactions amongst arguments, we propose arbitrary-order conditional random fields (CRFs) that are capable of modeling full interaction amongst an arbitrary number of arguments of a given predicate. To achieve tractable representation and inference, we apply canonical polyadic decomposition to the arbitrary-order factor in our proposed CRF and utilize mean-field variational inference for approximate inference. We further unfold our iterative inference procedure into a recurrent neural network that is connected to our neural encoder and scorer, enabling end-to-end training and inference. Finally, we also improve our model with several techniques such as span-based scoring and decoding. Our experiments show that our approach achieves state-of-the-art performance in FSRL. \ No newline at end of file diff --git a/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract) b/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract) new file mode 100644 index 0000000000..3736444c2b --- /dev/null +++ b/data/2024/aaai/Frequency Oracle for Sensitive Data Monitoring (Student Abstract) @@ -0,0 +1 @@ +As data privacy issues grow, finding the best privacy preservation algorithm for each situation is increasingly essential. This research has focused on understanding the frequency oracles (FO) privacy preservation algorithms. FO conduct the frequency estimation of any value in the domain. The aim is to explore how each can be best used and recommend which one to use with which data type. We experimented with different data scenarios and federated learning settings. Results showed clear guidance on when to use a specific algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition b/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition new file mode 100644 index 0000000000..f15fb225de --- /dev/null +++ b/data/2024/aaai/Frequency Shuffling and Enhancement for Open Set Recognition @@ -0,0 +1 @@ +Open-Set Recognition (OSR) aims to accurately identify known classes while effectively rejecting unknown classes to guarantee reliability. Most existing OSR methods focus on learning in the spatial domain, where subtle texture and global structure are potentially intertwined. Empirical studies have shown that DNNs trained in the original spatial domain are inclined to over-perceive subtle texture. The biased semantic perception could lead to catastrophic over-confidence when predicting both known and unknown classes. To this end, we propose an innovative approach by decomposing the spatial domain to the frequency domain to separately consider global (low-frequency) and subtle (high-frequency) information, named Frequency Shuffling and Enhancement (FreSH). To alleviate the overfitting of subtle texture, we introduce the High-Frequency Shuffling (HFS) strategy that generates diverse high-frequency information and promotes the capture of low-frequency invariance. Moreover, to enhance the perception of global structure, we propose the Low-Frequency Residual (LFR) learning procedure that constructs a composite feature space, integrating low-frequency and original spatial features. Experiments on various benchmarks demonstrate that the proposed FreSH consistently trumps the state-of-the-arts by a considerable margin. \ No newline at end of file diff --git a/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector b/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector new file mode 100644 index 0000000000..8c5b2213ee --- /dev/null +++ b/data/2024/aaai/Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector @@ -0,0 +1 @@ +Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media. Existing multimodal rumor detection has focused on mixing tokens among spatial and sequential locations for unimodal representation or fusing clues of rumor veracity across modalities. However, they suffer from less discriminative unimodal representation and are vulnerable to intricate location dependencies in the time-consuming fusion of spatial and sequential tokens. This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum and obtains highly discriminative spectrum features for multimodal representation and fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion, extracting the informative components for rumor detection. FSRU involves three novel mechanisms: utilizing the Fourier transform to convert features in the spatial domain to the frequency domain, the unimodal spectrum compression, and the cross-modal spectrum co-selection module in the frequency domain. Substantial experiments show that FSRU achieves satisfactory multimodal rumor detection performance. \ No newline at end of file diff --git a/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts b/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts new file mode 100644 index 0000000000..772a8f0601 --- /dev/null +++ b/data/2024/aaai/Frequency-Adaptive Pan-Sharpening with Mixture of Experts @@ -0,0 +1 @@ +Pan-sharpening involves reconstructing missing high-frequency information in multi-spectral images with low spatial resolution, using a higher-resolution panchromatic image as guidance. Although the inborn connection with frequency domain, existing pan-sharpening research has not almost investigated the potential solution upon frequency domain. To this end, we propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening, which consists of three key components: the Adaptive Frequency Separation Prediction Module, the Sub-Frequency Learning Expert Module, and the Expert Mixture Module. In detail, the first leverages the discrete cosine transform to perform frequency separation by predicting the frequency mask. On the basis of generated mask, the second with low-frequency MOE and high-frequency MOE takes account for enabling the effective low-frequency and high-frequency information reconstruction. Followed by, the final fusion module dynamically weights high frequency and low-frequency MOE knowledge to adapt to remote sensing images with significant content variations. Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes. Code will be made publicly at https://github.com/alexhe101/FAME-Net. \ No newline at end of file diff --git a/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning b/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning new file mode 100644 index 0000000000..c65336128e --- /dev/null +++ b/data/2024/aaai/Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Domain Learning @@ -0,0 +1 @@ +This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection. \ No newline at end of file diff --git a/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation b/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation new file mode 100644 index 0000000000..075aae6aa4 --- /dev/null +++ b/data/2024/aaai/Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation @@ -0,0 +1 @@ +Recently, text-to-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/. \ No newline at end of file diff --git a/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability b/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability new file mode 100644 index 0000000000..fa7b44037f --- /dev/null +++ b/data/2024/aaai/Friendly Attacks to Improve Channel Coding Reliability @@ -0,0 +1 @@ +This paper introduces a novel approach called "friendly attack" aimed at enhancing the performance of error correction channel codes. Inspired by the concept of adversarial attacks, our method leverages the idea of introducing slight perturbations to the neural network input, resulting in a substantial impact on the network's performance. By introducing small perturbations to fixed-point modulated codewords before transmission, we effectively improve the decoder's performance without violating the input power constraint. The perturbation design is accomplished by a modified iterative fast gradient method. This study investigates various decoder architectures suitable for computing gradients to obtain the desired perturbations. Specifically, we consider belief propagation (BP) for LDPC codes; the error correcting code transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for convolutional codes. We demonstrate that the proposed friendly attack method can improve the reliability across different channels, modulations, codes, and decoders. This method allows us to increase the reliability of communication with a legacy receiver by simply modifying the transmitted codeword appropriately. \ No newline at end of file diff --git a/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery b/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery new file mode 100644 index 0000000000..e20bc81424 --- /dev/null +++ b/data/2024/aaai/From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery @@ -0,0 +1 @@ +Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery. \ No newline at end of file diff --git a/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation b/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation new file mode 100644 index 0000000000..cb811b9c86 --- /dev/null +++ b/data/2024/aaai/From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation @@ -0,0 +1,7 @@ +We study the problem of extracting emotions and the causes behind these emotions in conversations. +Existing methods either tackle them separately or jointly model them at the coarse-grained level of emotions (fewer emotion categories) and causes (utterance-level causes). +In this work, we aim to jointly extract more fine-grained emotions and causes. +We construct a fine-grained dataset FG-RECCON, includes 16 fine-grained emotion categories and span-level causes. +To further improve the fine-grained extraction performance, we propose to utilize the casual discourse knowledge in a knowledge distillation way. +Specifically, the teacher model learns to predict causal connective words between utterances, and then guides the student model in identifying both the fine-grained emotion labels and causal spans. +Experimental results demonstrate that our distillation method achieves state-of-the-art performance on both RECCON and FG-RECCON dataset. \ No newline at end of file diff --git a/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students b/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students new file mode 100644 index 0000000000..27941a0657 --- /dev/null +++ b/data/2024/aaai/From Consumers to Critical Users: Prompty, an AI Literacy Tool for High School Students @@ -0,0 +1 @@ +In an age where Large Language Models (LLMs) expedite the generation of text, the skills for critically evaluating and creating meaningful text using these models are often lacking. To help classroom teachers address this, we introduce Prompty, a specialized teaching tool co-designed to facilitate both critical and effective use of LLMs. Prompty serves multiple learning goals: it allows students to critically evaluate text generated by LLMs, aids in their writing practice, and provides a deeper understanding of how LLMs function—all within a student-friendly environment secured by essential guardrails. Prompty was co-designed in collaboration with high school teachers as part of CRAFT, an initiative by Stanford University to promote AI literacy. It was pilot-tested in a high school English class to serve as an AI writing assistant, focusing on the critical evaluation of machine-generated text. This trial yielded preliminary evidence that attests to the tool's effectiveness in fulfilling its educational goals. The findings from the pilot study indicate that easy-to-use tools like Prompty have great potential. These tools can be adapted to fit the goals of individual teachers. They can help in achieving subject-specific learning goals while serving as an effective way to teach AI concepts in high school. \ No newline at end of file diff --git a/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast b/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast new file mode 100644 index 0000000000..d4c1b65cbd --- /dev/null +++ b/data/2024/aaai/From GARCH to Neural Network for Volatility Forecast @@ -0,0 +1 @@ +Volatility, as a measure of uncertainty, plays a crucial role in numerous financial activities such as risk management. The Econometrics and Machine Learning communities have developed two distinct approaches for financial volatility forecasting: the stochastic approach and the neural network (NN) approach. Despite their individual strengths, these methodologies have conventionally evolved in separate research trajectories with little interaction between them. This study endeavors to bridge this gap by establishing an equivalence relationship between models of the GARCH family and their corresponding NN counterparts. With the equivalence relationship established, we introduce an innovative approach, named GARCH-NN, for constructing NN-based volatility models. It obtains the NN counterparts of GARCH models and integrates them as components into an established NN architecture, thereby seamlessly infusing volatility stylized facts (SFs) inherent in the GARCH models into the neural network. We develop the GARCH-LSTM model to showcase the power of GARCH-NN approach. Experiment results validate that amalgamating the NN counterparts of the GARCH family models into established NN models leads to enhanced outcomes compared to employing the stochastic and NN models in isolation. \ No newline at end of file diff --git a/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space b/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space new file mode 100644 index 0000000000..73259f3ee4 --- /dev/null +++ b/data/2024/aaai/From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space @@ -0,0 +1 @@ +Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures. Code and Appendix are available on https://github.com/frederikpahde/rrclarc. \ No newline at end of file diff --git a/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis b/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis new file mode 100644 index 0000000000..2ae2cec78c --- /dev/null +++ b/data/2024/aaai/From Raw Video to Pedagogical Insights: A Unified Framework for Student Behavior Analysis @@ -0,0 +1 @@ +Understanding student behavior in educational settings is critical in improving both the quality of pedagogy and the level of student engagement. While various AI-based models exist for classroom analysis, they tend to specialize in limited tasks and lack generalizability across diverse educational environments. Additionally, these models often fall short in ensuring student privacy and in providing actionable insights accessible to educators. To bridge this gap, we introduce a unified, end-to-end framework by leveraging temporal action detection techniques and advanced large language models for a more nuanced student behavior analysis. Our proposed framework provides an end-to-end pipeline that starts with raw classroom video footage and culminates in the autonomous generation of pedagogical reports. It offers a comprehensive and scalable solution for student behavior analysis. Experimental validation confirms the capability of our framework to accurately identify student behaviors and to produce pedagogically meaningful insights, thereby setting the stage for future AI-assisted educational assessments. \ No newline at end of file diff --git a/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue b/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue new file mode 100644 index 0000000000..b75fbe8baf --- /dev/null +++ b/data/2024/aaai/From Retrieval to Generation: A Simple and Unified Generative Model for End-to-End Task-Oriented Dialogue @@ -0,0 +1 @@ +Retrieving appropriate records from the external knowledge base to generate informative responses is the core capability of end-to-end task-oriented dialogue systems (EToDs). Most of the existing methods additionally train the retrieval model or use the memory network to retrieve the knowledge base, which decouples the knowledge retrieval task from the response generation task, making it difficult to jointly optimize and failing to capture the internal relationship between the two tasks. In this paper, we propose a simple and unified generative model for task-oriented dialogue systems, which recasts the EToDs task as a single sequence generation task and uses maximum likelihood training to train the two tasks in a unified manner. To prevent the generation of non-existent records, we design the prefix trie to constrain the model generation, which ensures consistency between the generated records and the existing records in the knowledge base. Experimental results on three public benchmark datasets demonstrate that our method achieves robust performance on generating system responses and outperforms the baseline systems. To facilitate future research in this area, the code is available at https://github.com/dzy1011/Uni-ToD. \ No newline at end of file diff --git a/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models b/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models new file mode 100644 index 0000000000..c403999194 --- /dev/null +++ b/data/2024/aaai/From Static to Dynamic: Knowledge Metabolism for Large Language Models @@ -0,0 +1,3 @@ +The immense parameter space of Large Language Models (LLMs) endows them with superior knowledge retention capabilities, allowing them to excel in a variety of natural language processing tasks. However, it also instigates difficulties in consistently tuning LMs to incorporate the most recent knowledge, which may further lead LMs to produce inaccurate and fabricated content. +To alleviate this issue, we propose a knowledge metabolism framework for LLMs. This framework proactively sustains the credibility of knowledge through an auxiliary external memory component and directly delivers pertinent knowledge for LM inference, thereby suppressing hallucinations caused by obsolete internal knowledge during the LM inference process. +Benchmark experiments demonstrate DynaMind's effectiveness in overcoming this challenge. The code and demo of DynaMind are available at: https://github.com/Elfsong/DynaMind. \ No newline at end of file diff --git a/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence b/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence new file mode 100644 index 0000000000..cf81c285ed --- /dev/null +++ b/data/2024/aaai/From Statistical Relational to Neuro-Symbolic Artificial Intelligence @@ -0,0 +1 @@ +The integration of learning and reasoning is one of the key challenges in artificial intelligence and machine learning today. The area of Neuro-Symbolic AI (NeSy) tackles this challenge by integrating symbolic reasoning with neural networks. In our recent work, we provided an introduction to NeSy by drawing several parallels to another field that has a rich tradition in integrating learning and reasoning, namely Statistical Relational Artificial Intelligence (StarAI). \ No newline at end of file diff --git a/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks b/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks new file mode 100644 index 0000000000..a481a83bc3 --- /dev/null +++ b/data/2024/aaai/From Toxic to Trustworthy: Using Self-Distillation and Semi-supervised Methods to Refine Neural Networks @@ -0,0 +1 @@ +Despite the tremendous success of deep neural networks (DNNs) across various fields, their susceptibility to potential backdoor attacks seriously threatens their application security, particularly in safety-critical or security-sensitive ones. Given this growing threat, there is a pressing need for research into purging backdoors from DNNs. However, prior efforts on erasing backdoor triggers not only failed to withstand increasingly powerful attacks but also resulted in reduced model performance. In this paper, we propose From Toxic to Trustworthy (FTT), an innovative approach to eliminate backdoor triggers while simultaneously enhancing model accuracy. Following the stringent and practical assumption of limited availability of clean data, we introduce a self-attention distillation (SAD) method to remove the backdoor by aligning the shallow and deep parts of the network. Furthermore, we first devise a semi-supervised learning (SSL) method that leverages ubiquitous and available poisoned data to further purify backdoors and improve accuracy. Extensive experiments on various attacks and models have shown that our FTT can reduce the attack success rate from 97% to 1% and improve the accuracy of 4% on average, demonstrating its effectiveness in mitigating backdoor attacks and improving model performance. Compared to state-of-the-art (SOTA) methods, our FTT can reduce the attack success rate by 2 times and improve the accuracy by 5%, shedding light on backdoor cleansing. \ No newline at end of file diff --git a/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder b/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder new file mode 100644 index 0000000000..248e857ef6 --- /dev/null +++ b/data/2024/aaai/Frozen CLIP Transformer Is an Efficient Point Cloud Encoder @@ -0,0 +1 @@ +The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are 19.7 AP50 on ScanNet V2 detection, 4.4 mIoU on S3DIS segmentation and 1.2 mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}. \ No newline at end of file diff --git a/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning b/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning new file mode 100644 index 0000000000..38f469270a --- /dev/null +++ b/data/2024/aaai/Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning @@ -0,0 +1,5 @@ +Large Language Models (LLM) exhibit zero-shot mathematical reasoning capacity as a behavior emergent with scale, commonly manifesting as chain-of-thoughts (CoT) reasoning. However, multiple empirical findings suggest that this prowess is exclusive to LLMs that have exorbitant sizes (beyond 50 billion parameters). Meanwhile, educational neuroscientists suggest that symbolic algebraic manipulation be introduced around the same time as arithmetic word problems so as to modularize language-to-formulation, symbolic manipulation of the formulation, and endgame arithmetic. +In this paper, we start with the hypothesis that much smaller LMs, which are weak at multi-step reasoning, can achieve reasonable arithmetic reasoning if arithmetic word problems are posed as a formalize-then-solve task. +In our architecture, which we call SyReLM, the LM serves the role of a translator to map natural language arithmetic questions into a formal language (FL) description. A symbolic solver then evaluates the FL expression to obtain the answer. +A small frozen LM, equipped with an efficient low-rank adapter, is capable of generating FL expressions that incorporate natural language descriptions of the arithmetic problem (e.g., variable names and their purposes, formal expressions combining variables, etc.). +We adopt policy-gradient reinforcement learning to train the adapted LM, informed by the non-differentiable symbolic solver. This marks a sharp departure from the recent development in tool-augmented LLMs, in which the external tools (e.g., calculator, Web search, etc.) are essentially detached from the learning phase of the LM. SyReLM shows massive improvements (e.g., +30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J 6B model) over base LMs, while keeping our testbed easy to diagnose and interpret, and within the reach of most researchers. \ No newline at end of file diff --git a/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks b/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks new file mode 100644 index 0000000000..1aa05c17f1 --- /dev/null +++ b/data/2024/aaai/Full Bayesian Significance Testing for Neural Networks @@ -0,0 +1 @@ +Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called nFBST, to overcome the limitation in relationship characterization of traditional approaches. A Bayesian neural network is utilized to fit the nonlinear and multi-dimensional relationships with small errors and avoid hard theoretical derivation by computing the evidence value. Besides, nFBST can test not only global significance but also local and instance-wise significance, which previous testing methods don't focus on. Moreover, nFBST is a general framework that can be extended based on the measures selected, such as Grad-nFBST, LRP-nFBST, DeepLIFT-nFBST, LIME-nFBST. A range of experiments on both simulated and real data are conducted to show the advantages of our method. \ No newline at end of file diff --git a/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective b/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective new file mode 100644 index 0000000000..9730293727 --- /dev/null +++ b/data/2024/aaai/Full-Body Motion Reconstruction with Sparse Sensing from Graph Perspective @@ -0,0 +1 @@ +Estimating 3D full-body pose from sparse sensor data is a pivotal technique employed for the reconstruction of realistic human motions in Augmented Reality and Virtual Reality. However, translating sparse sensor signals into comprehensive human motion remains a challenge since the sparsely distributed sensors in common VR systems fail to capture the motion of full human body. In this paper, we use well-designed Body Pose Graph (BPG) to represent the human body and translate the challenge into a prediction problem of graph missing nodes. Then, we propose a novel full-body motion reconstruction framework based on BPG. To establish BPG, nodes are initially endowed with features extracted from sparse sensor signals. Features from identifiable joint nodes across diverse sensors are amalgamated and processed from both temporal and spatial perspectives. Temporal dynamics are captured using the Temporal Pyramid Structure, while spatial relations in joint movements inform the spatial attributes. The resultant features serve as the foundational elements of the BPG nodes. To further refine the BPG, node features are updated through a graph neural network that incorporates edge reflecting varying joint relations. Our method's effectiveness is evidenced by the attained state-of-the-art performance, particularly in lower body motion, outperforming other baseline methods. Additionally, an ablation study validates the efficacy of each module in our proposed framework. \ No newline at end of file diff --git a/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation b/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation new file mode 100644 index 0000000000..db968a2d61 --- /dev/null +++ b/data/2024/aaai/Fully Data-Driven Pseudo Label Estimation for Pointly-Supervised Panoptic Segmentation @@ -0,0 +1 @@ +The core of pointly-supervised panoptic segmentation is estimating accurate dense pseudo labels from sparse point labels to train the panoptic head. Previous works generate pseudo labels mainly based on hand-crafted rules, such as connecting multiple points into polygon masks, or assigning the label information of labeled pixels to unlabeled pixels based on the artificially defined traversing distance. The accuracy of pseudo labels is limited by the quality of the hand-crafted rules (polygon masks are rough at object contour regions, and the traversing distance error will result in wrong pseudo labels). To overcome the limitation of hand-crafted rules, we estimate pseudo labels with a fully data-driven pseudo label branch, which is optimized by point labels end-to-end and predicts more accurate pseudo labels than previous methods. We also train an auxiliary semantic branch with point labels, it assists the training of the pseudo label branch by transferring semantic segmentation knowledge through shared parameters. Experiments on Pascal VOC and MS COCO demonstrate that our approach is effective and shows state-of-the-art performance compared with related works. Codes are available at https://github.com/BraveGroup/FDD. \ No newline at end of file diff --git a/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data b/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data new file mode 100644 index 0000000000..b0cff78198 --- /dev/null +++ b/data/2024/aaai/Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data @@ -0,0 +1 @@ +Multivariate Time-Series (MTS) data is crucial in various application fields. With its sequential and multi-source (multiple sensors) properties, MTS data inherently exhibits Spatial-Temporal (ST) dependencies, involving temporal correlations between timestamps and spatial correlations between sensors in each timestamp. To effectively leverage this information, Graph Neural Network-based methods (GNNs) have been widely adopted. However, existing approaches separately capture spatial dependency and temporal dependency and fail to capture the correlations between Different sEnsors at Different Timestamps (DEDT). Overlooking such correlations hinders the comprehensive modelling of ST dependencies within MTS data, thus restricting existing GNNs from learning effective representations. To address this limitation, we propose a novel method called Fully-Connected Spatial-Temporal Graph Neural Network (FC-STGNN), including two key components namely FC graph construction and FC graph convolution. For graph construction, we design a decay graph to connect sensors across all timestamps based on their temporal distances, enabling us to fully model the ST dependencies by considering the correlations between DEDT. Further, we devise FC graph convolution with a moving-pooling GNN layer to effectively capture the ST dependencies for learning effective representations. Extensive experiments show the effectiveness of FC-STGNN on multiple MTS datasets compared to SOTA methods. The code is available at https://github.com/Frank-Wang-oss/FCSTGNN. \ No newline at end of file diff --git a/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision b/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision new file mode 100644 index 0000000000..d05e6d99d6 --- /dev/null +++ b/data/2024/aaai/Fusing Conditional Submodular GAN and Programmatic Weak Supervision @@ -0,0 +1,3 @@ +Programmatic Weak Supervision (PWS) and generative models serve as crucial tools that enable researchers to maximize the utility of existing datasets without resorting to laborious data gathering and manual annotation processes. PWS uses various weak supervision techniques to estimate the underlying class labels of data, while generative models primarily concentrate on sampling from the underlying distribution of the given dataset. Although these methods have the potential to complement each other, they have mostly been studied independently. + Recently, WSGAN proposed a mechanism to fuse these two models. Their approach utilizes the discrete latent factors of InfoGAN for the training of the label models and leverages the class-dependent information of the label models to generate images of specific classes. However, the disentangled latent factor learned by the InfoGAN may not necessarily be class specific and hence could potentially affect the label model's accuracy. Moreover, the prediction of the label model is often noisy in nature and can have a detrimental impact on the quality of images generated by GAN. In our work, we address these challenges by (i) implementing a noise-aware classifier using the pseudo labels generated by the label model, (ii) utilizing the prediction of the noise-aware classifier for training the label model as well as generation of class-conditioned images. Additionally, We also investigate the effect of training the classifier with a subset of the dataset within a defined uncertainty budget on pseudo labels. We accomplish this by formalizing the subset selection problem as submodular maximization with a knapsack constraint on the entropy of pseudo labels. We conduct experiments on multiple datasets and demonstrate the efficacy of our methods on several tasks vis-a-vis the current state-of-the-art methods. Our implementation is +available at https://github.com/kyrs/subpws-gan \ No newline at end of file diff --git a/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement b/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement new file mode 100644 index 0000000000..b135f20fb7 --- /dev/null +++ b/data/2024/aaai/Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement @@ -0,0 +1 @@ +Remote physiology, which involves monitoring vital signs without the need for physical contact, has great potential for various applications. Current remote physiology methods rely only on a single camera or radio frequency (RF) sensor to capture the microscopic signatures from vital movements. However, our study shows that fusing deep RGB and RF features from both sensor streams can further improve performance. Because these multimodal features are defined in distinct dimensions and have varying contextual importance, the main challenge in the fusion process lies in the effective alignment of them and adaptive integration of features under dynamic scenarios. To address this challenge, we propose a novel vital sensing model, named Fusion-Vital, that combines the RGB and RF modalities through the new introduction of pairwise input formats and transformer-based fusion strategies. We also perform comprehensive experiments based on a newly collected and released remote vital dataset comprising synchronized video-RF sensors, showing the superiority of the fusion approach over the previous single-sensor baselines in various aspects. \ No newline at end of file diff --git a/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation b/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation new file mode 100644 index 0000000000..1e6b03ae57 --- /dev/null +++ b/data/2024/aaai/FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation @@ -0,0 +1 @@ +Depth uncertainty is a core challenge in 3D human pose estimation, especially when the camera parameters are unknown. Previous methods try to reduce the impact of depth uncertainty by multi-view and/or multi-frame feature fusion to utilize more spatial and temporal information. However, they generally lead to marginal improvements and their performance still cannot match the camera-parameter-required methods. The reason is that their handcrafted fusion schemes cannot fuse the features flexibly, e.g., the multi-view and/or multi-frame features are fused separately. Moreover, the diverse and complicated fusion schemes make the principle for developing effective fusion schemes unclear and also raises an open problem that whether there exist more simple and elegant fusion schemes. To address these issues, this paper proposes an extremely concise unified feature fusion transformer (FusionFormer) with minimized handcrafted design for 3D pose estimation. FusionFormer fuses both the multi-view and multi-frame features in a unified fusion scheme, in which all the features are accessible to each other and thus can be fused flexibly. Experimental results on several mainstream datasets demonstrate that FusionFormer achieves state-of-the-art performance. To our best knowledge, this is the first camera-parameter-free method to outperform the existing camera-parameter-required methods, revealing the tremendous potential of camera-parameter-free models. These impressive experimental results together with our concise feature fusion scheme resolve the above open problem. Another appealing feature of FusionFormer we observe is that benefiting from its effective fusion scheme, we can achieve impressive performance with smaller model size and less FLOPs. \ No newline at end of file diff --git "a/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis" "b/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis" new file mode 100644 index 0000000000..82e4205f3c --- /dev/null +++ "b/data/2024/aaai/F\302\263-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis" @@ -0,0 +1 @@ +Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs. Previous inference acceleration works either require costly retraining or are model-specific. To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models. The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames. Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights. Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned. Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability. \ No newline at end of file diff --git a/data/2024/aaai/G-LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors (Abstract Reprint) b/data/2024/aaai/G-LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection b/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection new file mode 100644 index 0000000000..e76e590220 --- /dev/null +++ b/data/2024/aaai/G-NAS: Generalizable Neural Architecture Search for Single Domain Generalization Object Detection @@ -0,0 +1 @@ +In this paper, we focus on a realistic yet challenging task, Single Domain Generalization Object Detection (S-DGOD), where only one source domain's data can be used for training object detectors, but have to generalize multiple distinct target domains. In S-DGOD, both high-capacity fitting and generalization abilities are needed due to the task's complexity. Differentiable Neural Architecture Search (NAS) is known for its high capacity for complex data fitting and we propose to leverage Differentiable NAS to solve S-DGOD. However, it may confront severe over-fitting issues due to the feature imbalance phenomenon, where parameters optimized by gradient descent are biased to learn from the easy-to-learn features, which are usually non-causal and spuriously correlated to ground truth labels, such as the features of background in object detection data. Consequently, this leads to serious performance degradation, especially in generalizing to unseen target domains with huge domain gaps between the source domain and target domains. To address this issue, we propose the Generalizable loss (G-loss), which is an OoD-aware objective, preventing NAS from over-fitting by using gradient descent to optimize parameters not only on a subset of easy-to-learn features but also the remaining predictive features for generalization, and the overall framework is named G-NAS. Experimental results on the S-DGOD urban-scene datasets demonstrate that the proposed G-NAS achieves SOTA performance compared to baseline methods. Codes are available at https://github.com/wufan-cse/G-NAS. \ No newline at end of file diff --git a/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features b/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features new file mode 100644 index 0000000000..470bfab4aa --- /dev/null +++ b/data/2024/aaai/G2L-CariGAN: Caricature Generation from Global Structure to Local Features @@ -0,0 +1 @@ +Existing GAN-based approaches to caricature generation mainly focus on exaggerating a character’s global facial structure. This often leads to the failure in highlighting significant facial features such as big eyes and hook nose. To address this limitation, we propose a new approach termed as G2L-CariGAN, which uses feature maps of spatial dimensions instead of latent codes for geometric exaggeration. G2L-CariGAN first exaggerates the global facial structure of the character on a low-dimensional feature map and then exaggerates its local facial features on a high-dimensional feature map. Moreover, we develop a caricature identity loss function based on feature maps, which well retains the character's identity after exaggeration. Our experiments have demonstrated that G2L-CariGAN outperforms the state-of-arts in terms of the quality of exaggerating a character and retaining its identity. \ No newline at end of file diff --git a/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model b/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model new file mode 100644 index 0000000000..b2416321f4 --- /dev/null +++ b/data/2024/aaai/G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model @@ -0,0 +1 @@ +The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the Pose-VQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm. \ No newline at end of file diff --git a/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework b/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework new file mode 100644 index 0000000000..4d9f422d71 --- /dev/null +++ b/data/2024/aaai/GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework @@ -0,0 +1 @@ +Particle-based Variational Inference (ParVI) methods approximate the target distribution by iteratively evolving finite weighted particle systems. Recent advances of ParVI methods reveal the benefits of accelerated position update strategies and dynamic weight adjustment approaches. In this paper, we propose the first ParVI framework that possesses both accelerated position update and dynamical weight adjustment simultaneously, named the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework. Generally, GAD-PVI simulates the semi-Hamiltonian gradient flow on a novel Information-Fisher-Rao space, which yields an additional decrease on the local functional dissipation. GAD-PVI is compatible with different dissimilarity functionals and associated smoothing approaches under three information metrics. Experiments on both synthetic and real-world data demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the state-of-the-art. \ No newline at end of file diff --git a/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking b/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking new file mode 100644 index 0000000000..cfe339de6f --- /dev/null +++ b/data/2024/aaai/GAMC: An Unsupervised Method for Fake News Detection Using Graph Autoencoder with Masking @@ -0,0 +1 @@ +With the rise of social media, the spread of fake news has become a significant concern, potentially misleading public perceptions and impacting social stability. Although deep learning methods like CNNs, RNNs, and Transformer-based models like BERT have enhanced fake news detection. However, they primarily focus on content and do not consider social context during news propagation. Graph-based techniques have incorporated the social context but are limited by the need for large labeled datasets. To address these challenges, this paper introduces GAMC, an unsupervised fake news detection technique using the Graph Autoencoder with Masking and Contrastive learning. By leveraging both the context and content of news propagation as self-supervised signals, our method reduces the dependency on labeled datasets. Specifically, GAMC begins by applying data augmentation to the original news propagation graphs. Subsequently, these augmented graphs are encoded using a graph encoder and subsequently reconstructed via a graph decoder. Finally, a composite loss function that encompasses both reconstruction error and contrastive loss is designed. Firstly, it ensures the model can effectively capture the latent features, based on minimizing the discrepancy between reconstructed and original graph representations. Secondly, it aligns the representations of augmented graphs that originate from the same source. Experiments on the real-world dataset validate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction b/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction new file mode 100644 index 0000000000..9ad37ccc9f --- /dev/null +++ b/data/2024/aaai/GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction @@ -0,0 +1 @@ +The past few years has witnessed the dominance of Graph Convolutional Networks (GCNs) over human motion prediction. Various styles of graph convolutions have been proposed, with each one meticulously designed and incorporated into a carefully-crafted network architecture. This paper breaks the limits of existing knowledge by proposing Universal Graph Convolution (UniGC), a novel graph convolution concept that re-conceptualizes different graph convolutions as its special cases. Leveraging UniGC on network-level, we propose GCNext, a novel GCN-building paradigm that dynamically determines the best-fitting graph convolutions both sample-wise and layer-wise. GCNext offers multiple use cases, including training a new GCN from scratch or refining a preexisting GCN. Experiments on Human3.6M, AMASS, and 3DPW datasets show that, by incorporating unique module-to-network designs, GCNext yields up to 9x lower computational cost than existing GCN methods, on top of achieving state-of-the-art performance. Our code is available at https://github.com/BradleyWang0416/GCNext. \ No newline at end of file diff --git a/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews b/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews new file mode 100644 index 0000000000..5124db575b --- /dev/null +++ b/data/2024/aaai/GEAR-Up: Generative AI and External Knowledge-Based Retrieval: Upgrading Scholarly Article Searches for Systematic Reviews @@ -0,0 +1 @@ +This paper addresses the time-intensive nature of systematic reviews (SRs) and proposes a solution leveraging advancements in Generative AI (e.g., ChatGPT) and external knowledge augmentation (e.g., Retrieval-Augmented Generation). The proposed system, GEAR-Up, automates query development and translation in SRs, enhancing efficiency by enriching user queries with context from language models and knowledge graphs. Collaborating with librarians, qualitative evaluations demonstrate improved reproducibility and search strategy quality. Access the demo at https://youtu.be/zMdP56GJ9mU. \ No newline at end of file diff --git a/data/2024/aaai/GLDL: Graph Label Distribution Learning b/data/2024/aaai/GLDL: Graph Label Distribution Learning new file mode 100644 index 0000000000..39ffbe3520 --- /dev/null +++ b/data/2024/aaai/GLDL: Graph Label Distribution Learning @@ -0,0 +1 @@ +Label Distribution Learning (LDL), as a more general learning setting than generic single-label and multi-label learning, has been commonly used in computer vision and many other applications. To date, existing LDL approaches are designed and applied to data without considering the interdependence between instances. In this paper, we propose a Graph Label Distribution Learning (GLDL) framework, which explicitly models three types of relationships: instance-instance, label-label, and instance-label, to learn the label distribution for networked data. A label-label network is learned to capture label-to-label correlation, through which GLDL can accurately learn label distributions for nodes. Dual graph convolution network (GCN) Co-training with heterogeneous message passing ensures two GCNs, one focusing on instance-instance relationship and the other one targeting label-label correlation, are jointly trained such that instance-instance relationship can help induce label-label correlation and vice versa. Our theoretical study derives the error bound of GLDL. For verification, four benchmark datasets with label distributions for nodes are created using common graph benchmarks. The experiments show that considering dependency helps learn better label distributions for networked data, compared to state-of-the-art LDL baseline. In addition, GLDL not only outperforms simple GCN and graph attention networks (GAT) using distribution loss but is also superior to its variant considering label-label relationship as a static network. GLDL and its benchmarks are the first research endeavors to address LDL for graphs. Code and benchmark data are released for public access. \ No newline at end of file diff --git a/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery b/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery new file mode 100644 index 0000000000..1906dac709 --- /dev/null +++ b/data/2024/aaai/GLH-Water: A Large-Scale Dataset for Global Surface Water Detection in Large-Size Very-High-Resolution Satellite Imagery @@ -0,0 +1 @@ +Global surface water detection in very-high-resolution (VHR) satellite imagery can directly serve major applications such as refined flood mapping and water resource assessment. Although achievements have been made in detecting surface water in small-size satellite images corresponding to local geographic scales, datasets and methods suitable for mapping and analyzing global surface water have yet to be explored. To encourage the development of this task and facilitate the implementation of relevant applications, we propose the GLH-water dataset that consists of 250 satellite images and 40.96 billion pixels labeled surface water annotations that are distributed globally and contain water bodies exhibiting a wide variety of types (e.g. , rivers, lakes, and ponds in forests, irrigated fields, bare areas, and urban areas). Each image is of the size 12,800 × 12,800 pixels at 0.3 meter spatial resolution. To build a benchmark for GLH-water, we perform extensive experiments employing representative surface water detection models, popular semantic segmentation models, and ultra-high resolution segmentation models. Furthermore, we also design a strong baseline with the novel pyramid consistency loss (PCL) to initially explore this challenge, increasing IoU by 2.4% over the next best baseline. Finally, we implement the cross-dataset generalization and pilot area application experiments, and the superior performance illustrates the strong generalization and practical application value of GLH-water dataset. Project page: https://jack-bo1220.github.io/project/GLH-water.html \ No newline at end of file diff --git a/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time b/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time new file mode 100644 index 0000000000..b10c4f7b6b --- /dev/null +++ b/data/2024/aaai/GLOP: Learning Global Partition and Local Construction for Solving Large-Scale Routing Problems in Real-Time @@ -0,0 +1 @@ +The recent end-to-end neural solvers have shown promise for small-scale routing problems but suffered from limited real-time scaling-up performance. This paper proposes GLOP (Global and Local Optimization Policies), a unified hierarchical framework that efficiently scales toward large-scale routing problems. GLOP hierarchically partitions large routing problems into Travelling Salesman Problems (TSPs) and TSPs into Shortest Hamiltonian Path Problems. For the first time, we hybridize non-autoregressive neural heuristics for coarse-grained problem partitions and autoregressive neural heuristics for fine-grained route constructions, leveraging the scalability of the former and the meticulousness of the latter. Experimental results show that GLOP achieves competitive and state-of-the-art real-time performance on large-scale routing problems, including TSP, ATSP, CVRP, and PCTSP. Our code is available at: https://github.com/henry-yeh/GLOP. \ No newline at end of file diff --git a/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval b/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval new file mode 100644 index 0000000000..d61e0d3a60 --- /dev/null +++ b/data/2024/aaai/GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval @@ -0,0 +1 @@ +Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer. \ No newline at end of file diff --git a/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting b/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting new file mode 100644 index 0000000000..407915aecc --- /dev/null +++ b/data/2024/aaai/GMP-AR: Granularity Message Passing and Adaptive Reconciliation for Temporal Hierarchy Forecasting @@ -0,0 +1 @@ +Time series forecasts of different temporal granularity are widely used in real-world applications, e.g., sales prediction in days and weeks for making different inventory plans. However, these tasks are usually solved separately without ensuring coherence, which is crucial for aligning downstream decisions. Previous works mainly focus on ensuring coherence with some straightforward methods, e.g., aggregation from the forecasts of fine granularity to the coarse ones, and allocation from the coarse granularity to the fine ones. These methods merely take the temporal hierarchical structure to maintain coherence without improving the forecasting accuracy. In this paper, we propose a novel granularity message-passing mechanism (GMP) that leverages temporal hierarchy information to improve forecasting performance and also utilizes an adaptive reconciliation (AR) strategy to maintain coherence without performance loss. Furthermore, we introduce an optimization module to achieve task-based targets while adhering to more real-world constraints. Experiments on real-world datasets demonstrate that our framework (GMP-AR) achieves superior performances on temporal hierarchical forecasting tasks compared to state-of-the-art methods. In addition, our framework has been successfully applied to a real-world task of payment traffic management in Alipay by integrating with the task-based optimization module. \ No newline at end of file diff --git a/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation b/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation new file mode 100644 index 0000000000..9f1e1e3687 --- /dev/null +++ b/data/2024/aaai/GO-DICE: Goal-Conditioned Option-Aware Offline Imitation Learning via Stationary Distribution Correction Estimation @@ -0,0 +1 @@ +Offline imitation learning (IL) refers to learning expert behavior solely from demonstrations, without any additional interaction with the environment. Despite significant advances in offline IL, existing techniques find it challenging to learn policies for long-horizon tasks and require significant re-training when task specifications change. Towards addressing these limitations, we present GO-DICE an offline IL technique for goal-conditioned long-horizon sequential tasks. GO-DICE discerns a hierarchy of sub-tasks from demonstrations and uses these to learn separate policies for sub-task transitions and action execution, respectively; this hierarchical policy learning facilitates long-horizon reasoning.Inspired by the expansive DICE-family of techniques, policy learning at both the levels transpires within the space of stationary distributions. Further, both policies are learnt with goal conditioning to minimize need for retraining when task goals change. Experimental results substantiate that GO-DICE outperforms recent baselines, as evidenced by a marked improvement in the completion rate of increasingly challenging pick-and-place Mujoco robotic tasks. GO-DICE is also capable of leveraging imperfect demonstration and partial task segmentation when available, both of which boost task performance relative to learning from expert demonstrations alone. \ No newline at end of file diff --git a/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following b/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following new file mode 100644 index 0000000000..5230053147 --- /dev/null +++ b/data/2024/aaai/GOALNET: Interleaving Neural Goal Predicate Inference with Classical Planning for Generalization in Robot Instruction Following @@ -0,0 +1 @@ +Our goal is to enable a robot to learn how to sequence its actions to perform high-level tasks specified as natural language instructions, given successful demonstrations from a human partner. Our novel neuro-symbolic solution GOALNET builds an iterative two-step approach that interleaves (i) inferring next subgoal predicate implied by the language instruction, for a given world state, and (ii) synthesizing a feasible subgoal-reaching plan from that state. The agent executes the plan, and the two steps are repeated. GOALNET combines (i) learning, where dense representations are acquired for language instruction and the world state via a neural network prediction model, enabling generalization to novel settings and (ii) planning, where the cause-effect modeling by a classical planner eschews irrelevant predicates, facilitating multi-stage decision making in large domains. GOALNET obtains 78% improvement in the goal reaching rate in comparison to several state-of-the-art approaches on benchmark data with multi-stage instructions. Further, GOALNET can generalize to novel instructions for scenes with unseen objects. Source code available at https://github. com/reail-iitd/goalnet. \ No newline at end of file diff --git a/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection b/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection new file mode 100644 index 0000000000..1e65db39e5 --- /dev/null +++ b/data/2024/aaai/GOODAT: Towards Test-Time Graph Out-of-Distribution Detection @@ -0,0 +1 @@ +Graph neural networks (GNNs) have found widespread application in modeling graph data across diverse domains. While GNNs excel in scenarios where the testing data shares the distribution of their training counterparts (in distribution, ID), they often exhibit incorrect predictions when confronted with samples from an unfamiliar distribution (out-of-distribution, OOD). To identify and reject OOD samples with GNNs, recent studies have explored graph OOD detection, often focusing on training a specific model or modifying the data on top of a well-trained GNN. Despite their effectiveness, these methods come with heavy training resources and costs, as they need to optimize the GNN-based models on training data. Moreover, their reliance on modifying the original GNNs and accessing training data further restricts their universality. To this end, this paper introduces a method to detect Graph Out-of-Distribution At Test-time (namely GOODAT), a data-centric, unsupervised, and plug-and-play solution that operates independently of training data and modifications of GNN architecture. With a lightweight graph masker, GOODAT can learn informative subgraphs from test samples, enabling the capture of distinct graph patterns between OOD and ID samples. To optimize the graph masker, we meticulously design three unsupervised objective functions based on the graph information bottleneck principle, motivating the masker to capture compact yet informative subgraphs for OOD detection. Comprehensive evaluations confirm that our GOODAT method outperforms state-of-the-art benchmarks across a variety of real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting b/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting new file mode 100644 index 0000000000..1d00724890 --- /dev/null +++ b/data/2024/aaai/GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting @@ -0,0 +1 @@ +Time series forecasting is an essential area of machine learning with a wide range of real-world applications. Most of the previous forecasting models aim to capture dynamic characteristics from uni-modal numerical historical data. Although extra knowledge can boost the time series forecasting performance, it is hard to collect such information. In addition, how to fuse the multimodal information is non-trivial. In this paper, we first propose a general principle of collecting the corresponding textual information from different data sources with the help of modern large language models (LLM). Then, we propose a prompt-based LLM framework to utilize both the numerical data and the textual information simultaneously, named GPT4MTS. In practice, we propose a GDELT-based multimodal time series dataset for news impact forecasting, which provides a concise and well-structured version of time series dataset with textual information for further research in communication. Through extensive experiments, we demonstrate the effectiveness of our proposed method on forecasting tasks with extra-textual information. \ No newline at end of file diff --git a/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution b/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution new file mode 100644 index 0000000000..e097c988f6 --- /dev/null +++ b/data/2024/aaai/GSDD: Generative Space Dataset Distillation for Image Super-resolution @@ -0,0 +1 @@ +Single image super-resolution (SISR), especially in the real world, usually builds a large amount of LR-HR image pairs to learn representations that contain rich textural and structural information. However, relying on massive data for model training not only reduces training efficiency, but also causes heavy data storage burdens. In this paper, we attempt a pioneering study on dataset distillation (DD) for SISR problems to explore how data could be slimmed and compressed for the task. Unlike previous coreset selection methods which select a few typical examples directly from the original data, we remove the limitation that the selected data cannot be further edited, and propose to synthesize and optimize samples to preserve more task-useful representations. Concretely, by utilizing pre-trained GANs as a suitable approximation of realistic data distribution, we propose GSDD, which distills data in a latent generative space based on GAN-inversion techniques. By optimizing them to match with the practical data distribution in an informative feature space, the distilled data could then be synthesized. Experimental results demonstrate that when trained with our distilled data, GSDD can achieve comparable performance to the state-of-the-art (SOTA) SISR algorithms, while a nearly ×8 increase in training efficiency and a saving of almost 93.2% data storage space can be realized. Further experiments on challenging real-world data also demonstrate the promising generalization ability of GSDD. \ No newline at end of file diff --git a/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection b/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection new file mode 100644 index 0000000000..aac547aa78 --- /dev/null +++ b/data/2024/aaai/GSENet: Global Semantic Enhancement Network for Lane Detection @@ -0,0 +1 @@ +Lane detection is the cornerstone of autonomous driving. Although existing methods have achieved promising results, there are still limitations in addressing challenging scenarios such as abnormal weather, occlusion, and curves. These scenarios with low visibility usually require to rely on the broad information of the entire scene provided by global semantics and local texture information to predict the precise position and shape of the lane lines. In this paper, we propose a Global Semantic Enhancement Network for lane detection, which involves a complete set of systems for feature extraction and global features transmission. Traditional methods for global feature extraction usually require deep convolution layer stacks. However, this approach of obtaining global features solely through a larger receptive field not only fails to capture precise global features but also leads to an overly deep model, which results in slow inference speed. To address these challenges, we propose a novel operation called the Global feature Extraction Module (GEM). Additionally, we introduce the Top Layer Auxiliary Module (TLAM) as a channel for feature distillation, which facilitates a bottom-up transmission of global features. Furthermore, we introduce two novel loss functions: the Angle Loss, which account for the angle between predicted and ground truth lanes, and the Generalized Line IoU Loss function that considers the scenarios where significant deviations occur between the prediction of lanes and ground truth in some harsh conditions. The experimental results reveal that the proposed method exhibits remarkable superiority over the current state-of-the-art techniques for lane detection.Our codes are available at:https://github.com/crystal250/GSENet. \ No newline at end of file diff --git a/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field b/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field new file mode 100644 index 0000000000..c83fa6d29b --- /dev/null +++ b/data/2024/aaai/GSN: Generalisable Segmentation in Neural Radiance Field @@ -0,0 +1 @@ +Traditional Radiance Field (RF) representations capture details of a specific scene and must be trained afresh on each scene. Semantic feature fields have been added to RFs to facilitate several segmentation tasks. Generalised RF representations learn the principles of view interpolation. A generalised RF can render new views of an unknown and untrained scene, given a few views. We present a way to distil feature fields into the generalised GNT representation. Our GSN representation generates new views of unseen scenes on the fly along with consistent, per-pixel semantic features. This enables multi-view segmentation of arbitrary new scenes. We show different semantic features being distilled into generalised RFs. Our multi-view segmentation results are on par with methods that use traditional RFs. GSN closes the gap between standard and generalisable RF methods significantly. Project Page: https://vinayak-vg.github.io/GSN/ \ No newline at end of file diff --git a/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints b/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints new file mode 100644 index 0000000000..8e6a6e7c07 --- /dev/null +++ b/data/2024/aaai/GSO-Net: Grid Surface Optimization via Learning Geometric Constraints @@ -0,0 +1 @@ +In the context of surface representations, we find a natural structural similarity between grid surface and image data. Motivated by this inspiration, we propose a novel approach: encoding grid surfaces as geometric images and using image processing methods to address surface optimization-related problems. As a result, we have created the first dataset for grid surface optimization and devised a learning-based grid surface optimization network specifically tailored to geometric images, addressing the surface optimization problem through a data-driven learning of geometric constraints paradigm. We conduct extensive experiments on developable surface optimization, surface flattening, and surface denoising tasks using the designed network and datasets. The results demonstrate that our proposed method not only addresses the surface optimization problem better than traditional numerical optimization methods, especially for complex surfaces, but also boosts the optimization speed by multiple orders of magnitude. This pioneering study successfully applies deep learning methods to the field of surface optimization and provides a new solution paradigm for similar tasks, which will provide inspiration and guidance for future developments in the field of discrete surface optimization. The code and dataset are available at https://github.com/chaoyunwang/GSO-Net. \ No newline at end of file diff --git a/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection b/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection new file mode 100644 index 0000000000..eeb020b18f --- /dev/null +++ b/data/2024/aaai/G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection @@ -0,0 +1 @@ +Multimodal sarcasm detection, aiming to detect the ironic sentiment within multimodal social data, has gained substantial popularity in both the natural language processing and computer vision communities. Recently, graph-based studies by drawing sentimental relations to detect multimodal sarcasm have made notable advancements. However, they have neglected exploiting graph-based global semantic congruity from existing instances to facilitate the prediction, which ultimately hinders the model's performance. In this paper, we introduce a new inference paradigm that leverages global graph-based semantic awareness to handle this task. Firstly, we construct fine-grained multimodal graphs for each instance and integrate them into semantic space to draw graph-based relations. During inference, we leverage global semantic congruity to retrieve k-nearest neighbor instances in semantic space as references for voting on the final prediction. To enhance the semantic correlation of representation in semantic space, we also introduce label-aware graph contrastive learning to further improve the performance. Experimental results demonstrate that our model achieves state-of-the-art (SOTA) performance in multimodal sarcasm detection. The code will be available at https://github.com/upccpu/G2SAM. \ No newline at end of file diff --git a/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers b/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers new file mode 100644 index 0000000000..6c2fa42778 --- /dev/null +++ b/data/2024/aaai/GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers @@ -0,0 +1 @@ +Transformers based on attention mechanisms exhibit vulnerability to adversarial examples, posing a substantial threat to the security of their applications. Aiming to solve this problem, the concept of robustness certification is introduced to formally ascertain the presence of any adversarial example within a specified region surrounding a given sample. However, prior works have neglected the dependencies among inputs of softmax (the most complex function in attention mechanisms) during linear relaxations. This oversight has consequently led to imprecise certification results. In this work, we introduce GaLileo, a general linear relaxation framework designed to certify the robustness of Transformers. GaLileo effectively surmounts the trade-off between precision and efficiency in robustness certification through our innovative n-dimensional relaxation approach. Notably, our relaxation technique represents a pioneering effort as the first linear relaxation for n-dimensional functions such as softmax. Our novel approach successfully transcends the challenges posed by the curse of dimensionality inherent in linear relaxations, thereby enhancing linear bounds by incorporating input dependencies. Our evaluations encompassed a thorough analysis utilizing the SST and Yelp datasets along with diverse Transformers of different depths and widths. The experimental results demonstrate that, as compared to the baseline method CROWN-BaF, GaLileo achieves up to 3.24 times larger certified radii while requiring similar running times. Additionally, GaLileo successfully attains certification for Transformers' robustness against multi-word lp perturbations, marking a notable accomplishment in this field. \ No newline at end of file diff --git a/data/2024/aaai/Game-Theoretic Unlearnable Example Generator b/data/2024/aaai/Game-Theoretic Unlearnable Example Generator new file mode 100644 index 0000000000..c28d3f24cc --- /dev/null +++ b/data/2024/aaai/Game-Theoretic Unlearnable Example Generator @@ -0,0 +1 @@ +Unlearnable example attacks are data poisoning attacks aiming to degrade the clean test accuracy of deep learning by adding imperceptible perturbations to the training samples, which can be formulated as a bi-level optimization problem. However, directly solving this optimization problem is intractable for deep neural networks. In this paper, we investigate unlearnable example attacks from a game-theoretic perspective, by formulating the attack as a nonzero sum Stackelberg game. First, the existence of game equilibria is proved under the normal setting and the adversarial training setting. It is shown that the game equilibrium gives the most powerful poison attack in that the victim has the lowest test accuracy among all networks within the same hypothesis space when certain loss functions are used. Second, we propose a novel attack method, called the Game Unlearnable Example (GUE), which has three main gradients. (1) The poisons are obtained by directly solving the equilibrium of the Stackelberg game with a first-order algorithm. (2) We employ an autoencoder-like generative network model as the poison attacker. (3) A novel payoff function is introduced to evaluate the performance of the poison. Comprehensive experiments demonstrate that GUE can effectively poison the model in various scenarios. Furthermore, the GUE still works by using a relatively small percentage of the training data to train the generator, and the poison generator can generalize to unseen data well. Our implementation code can be found at https://github.com/hong-xian/gue. \ No newline at end of file diff --git a/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks b/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks new file mode 100644 index 0000000000..e00b8e4bbc --- /dev/null +++ b/data/2024/aaai/Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks @@ -0,0 +1 @@ +Spiking neural networks (SNNs) are emerging as an energy-efficient alternative to traditional artificial neural networks (ANNs) due to their unique spike-based event-driven nature. Coding is crucial in SNNs as it converts external input stimuli into spatio-temporal feature sequences. However, most existing deep SNNs rely on direct coding that generates powerless spike representation and lacks the temporal dynamics inherent in human vision. Hence, we introduce Gated Attention Coding (GAC), a plug-and-play module that leverages the multi-dimensional gated attention unit to efficiently encode inputs into powerful representations before feeding them into the SNN architecture. GAC functions as a preprocessing layer that does not disrupt the spike-driven nature of the SNN, making it amenable to efficient neuromorphic hardware implementation with minimal modifications. Through an observer model theoretical analysis, we demonstrate GAC's attention mechanism improves temporal dynamics and coding efficiency. Experiments on CIFAR10/100 and ImageNet datasets demonstrate that GAC achieves state-of-the-art accuracy with remarkable efficiency. Notably, we improve top-1 accuracy by 3.10% on CIFAR100 with only 6-time steps and 1.07% on ImageNet while reducing energy usage to 66.9% of the previous works. To our best knowledge, it is the first time to explore the attention-based dynamic coding scheme in deep SNNs, with exceptional effectiveness and efficiency on large-scale datasets. Code is available at https://github.com/bollossom/GAC. \ No newline at end of file diff --git a/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding b/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding new file mode 100644 index 0000000000..06c811e30e --- /dev/null +++ b/data/2024/aaai/Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding @@ -0,0 +1 @@ +In the weakly supervised temporal video grounding study, previous methods use predetermined single Gaussian proposals which lack the ability to express diverse events described by the sentence query. To enhance the expression ability of a proposal, we propose a Gaussian mixture proposal (GMP) that can depict arbitrary shapes by learning importance, centroid, and range of every Gaussian in the mixture. In learning GMP, each Gaussian is not trained in a feature space but is implemented over a temporal location. Thus the conventional feature-based learning for Gaussian mixture model is not valid for our case. In our special setting, to learn moderately coupled Gaussian mixture capturing diverse events, we newly propose a pull-push learning scheme using pulling and pushing losses, each of which plays an opposite role to the other. The effects of components in our scheme are verified in-depth with extensive ablation studies and the overall scheme achieves state-of-the-art performance. Our code is available at https://github.com/sunoh-kim/pps. \ No newline at end of file diff --git a/data/2024/aaai/Gaussian Process Neural Additive Models b/data/2024/aaai/Gaussian Process Neural Additive Models new file mode 100644 index 0000000000..68db548611 --- /dev/null +++ b/data/2024/aaai/Gaussian Process Neural Additive Models @@ -0,0 +1 @@ +Deep neural networks have revolutionized many fields, but their black-box nature also occasionally prevents their wider adoption in fields such as healthcare and finance where interpretable and explainable models are required. The recent development of Neural Additive Models (NAMs) poses a major step in the direction of interpretable deep learning for tabular datasets. In this paper, we propose a new subclass of NAMs that utilize a single-layer neural network construction of the Gaussian process via random Fourier features, which we call Gaussian Process Neural Additive Models (GP-NAM). GP-NAMs have the advantage of a convex objective function and number of trainable parameters that grows linearly with feature dimensions. It suffers no loss in performance compared with deeper NAM approaches because GPs are well-suited to learning complex non-parametric univariate functions. We demonstrate the performance of GP-NAM on several tabular datasets, showing that it achieves comparable performance in both classification and regression tasks with a massive reduction in the number of parameters. \ No newline at end of file diff --git a/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues b/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues new file mode 100644 index 0000000000..eb6febc727 --- /dev/null +++ b/data/2024/aaai/Gaze Target Detection by Merging Human Attention and Activity Cues @@ -0,0 +1 @@ +Despite achieving impressive performance, current methods for detecting gaze targets, which depend on visual saliency and spatial scene geometry, continue to face challenges when it comes to detecting gaze targets within intricate image backgrounds. One of the primary reasons for this lies in the oversight of the intricate connection between human attention and activity cues. In this study, we introduce an innovative approach that amalgamates the visual saliency detection with the body-part & object interaction both guided by the soft gaze attention. This fusion enables precise and dependable detection of gaze targets amidst intricate image backgrounds. Our approach attains state-of-the-art performance on both the Gazefollow benchmark and the GazeVideoAttn benchmark. In comparison to recent methods that rely on intricate 3D reconstruction of a single input image, our approach, which solely leverages 2D image information, still exhibits a substantial lead across all evaluation metrics, positioning it closer to human-level performance. These outcomes underscore the potent effectiveness of our proposed method in the gaze target detection task. \ No newline at end of file diff --git a/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process b/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process new file mode 100644 index 0000000000..94dad26a63 --- /dev/null +++ b/data/2024/aaai/Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process @@ -0,0 +1 @@ +Gaze estimation aims to accurately estimate the direction or position at which a person is looking. With the development of deep learning techniques, a number of gaze estimation methods have been proposed and achieved state-of-the-art performance. However, these methods are limited to within-dataset settings, whose performance drops when tested on unseen datasets. We argue that this is caused by infinite and continuous gaze labels. To alleviate this problem, we propose using gaze frontalization as an auxiliary task to constrain gaze estimation. Based on this, we propose a novel gaze domain generalization framework named Gaze Frontalization-based Auxiliary Learning (GFAL) Framework which embeds the gaze frontalization process, i.e., guiding the feature so that the eyeball can rotate and look at the front (camera), without any target domain information during training. Experimental results show that our proposed framework is able to achieve state-of-the-art performance on gaze domain generalization task, which is competitive with or even superior to the SOTA gaze unsupervised domain adaptation methods. \ No newline at end of file diff --git a/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract) b/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract) new file mode 100644 index 0000000000..ef284edea5 --- /dev/null +++ b/data/2024/aaai/Gaze-Based Interaction Adaptation for People with Involuntary Head Movements (Student Abstract) @@ -0,0 +1 @@ +Gaze estimation is an important research area in computer vision and machine learning. Eye-tracking and gaze-based interactions have made assistive technology (AT) more accessible to people with physical limitations. However, a non-negligible proportion of existing AT users, including those having dyskinetic cerebral palsy (CP) or severe intellectual disabilities (ID), have difficulties in using eye trackers due to their involuntary body movements. In this paper, we propose an adaptation method pertaining to head movement prediction and fixation smoothing to stabilize our target users' gaze points on the screen and improve their user experience (UX) in gaze-based interaction. Our empirical experimentation shows that our method significantly shortens the users' selection time and increases their selection accuracy. \ No newline at end of file diff --git a/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants b/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants new file mode 100644 index 0000000000..13391abfd9 --- /dev/null +++ b/data/2024/aaai/General Commerce Intelligence: Glocally Federated NLP-Based Engine for Privacy-Preserving and Sustainable Personalized Services of Multi-Merchants @@ -0,0 +1 @@ +One of the most crucial capabilities in the commercial sector is a personalized prediction of a customer's next purchase. We present a novel method of creating a commerce intelligence engine that caters to multiple merchants intended for the UB Platform, managed by e-payment company Harex InfoTech. To cultivate this intelligence, we utilized payment receipt data and created a Natural Language Processing (NLP)-based commerce model using a Transformer to accommodate multinational and merchant trade. Our model, called General Commerce Intelligence (GCI), provides a range of services for merchants, including product recommendations, product brainstorming, product bundling, event promotions, collaborative marketing, target marketing, and demand fore-casting etc. To bolster user privacy and foster sustainable business collaboration, especially among micro-, small-, and medium-sized enterprises (MSMEs), the GCI model was trained through federated learning, especially with glocalization. This study delves into the structure, development, and assessment of GCI, showcasing its transformative capacity to implement User Centric AI and re-shape the global commerce landscape to benefit MSMEs. \ No newline at end of file diff --git a/data/2024/aaai/Generalisation through Negation and Predicate Invention b/data/2024/aaai/Generalisation through Negation and Predicate Invention new file mode 100644 index 0000000000..5cb93d8921 --- /dev/null +++ b/data/2024/aaai/Generalisation through Negation and Predicate Invention @@ -0,0 +1,2 @@ +The ability to generalise from a small number of examples is a fundamental challenge in machine learning. To tackle this challenge, we introduce an inductive logic programming (ILP) approach that combines negation and predicate invention. +Combining these two features allows an ILP system to generalise better by learning rules with universally quantified body-only variables. We implement our idea in NOPI, which can learn normal logic programs with predicate invention, including Datalog programs with stratified negation. Our experimental results on multiple domains show that our approach can improve predictive accuracies and learning times. \ No newline at end of file diff --git a/data/2024/aaai/Generalising Planning Environment Redesign b/data/2024/aaai/Generalising Planning Environment Redesign new file mode 100644 index 0000000000..73974e49de --- /dev/null +++ b/data/2024/aaai/Generalising Planning Environment Redesign @@ -0,0 +1 @@ +In Environment Design, one interested party seeks to affect another agent's decisions by applying changes to the environment. Most research on planning environment (re)design assumes the interested party's objective is to facilitate the recognition of goals and plans, and search over the space of environment modifications to find the minimal set of changes that simplify those tasks and optimise a particular metric. This search space is usually intractable, so existing approaches devise metric-dependent pruning techniques for performing search more efficiently. This results in approaches that are not able to generalise across different objectives and/or metrics. In this paper, we argue that the interested party could have objectives and metrics that are not necessarily related to recognising agents' goals or plans. Thus, to generalise the task of Planning Environment Redesign, we develop a general environment redesign approach that is metric-agnostic and leverages recent research on top-quality planning to efficiently redesign planning environments according to any interested party's objective and metric. Experiments over a set of environment redesign benchmarks show that our general approach outperforms existing approaches when using well-known metrics, such as facilitating the recognition of goals, as well as its effectiveness when solving environment redesign tasks that optimise a novel set of different metrics. \ No newline at end of file diff --git a/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation b/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation new file mode 100644 index 0000000000..7c96aa8383 --- /dev/null +++ b/data/2024/aaai/Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation @@ -0,0 +1,4 @@ +The performance of existing unsupervised video object segmentation methods typically suffers from severe performance degradation on test videos when tested in out-of-distribution scenarios. The primary reason is that the test data in real- +world may not follow the independent and identically distribution (i.i.d.) assumption, leading to domain shift. In this paper, we propose a generalizable fourier augmentation method during training to improve the generalization ability of the model. To achieve this, we perform Fast Fourier Transform (FFT) over the intermediate spatial domain features in each layer to yield corresponding frequency representations, including amplitude components (encoding scene-aware styles such as texture, color, contrast of the scene) and phase components (encoding rich semantics). We produce a variety of style features via Gaussian sampling to augment the training data, thereby improving the generalization capability of the model. To further improve the cross-domain generalization +performance of the model, we design a phase feature update strategy via exponential moving average using phase features from past frames in an online update manner, which could help the model to learn cross-domain-invariant features. Extensive experiments show that our proposed method achieves +the state-of-the-art performance on popular benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract) b/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract) new file mode 100644 index 0000000000..748df5ec47 --- /dev/null +++ b/data/2024/aaai/Generalizable Policy Improvement via Reinforcement Sampling (Student Abstract) @@ -0,0 +1 @@ +Current policy gradient techniques excel in refining policies over sampled states but falter when generalizing to unseen states. To address this, we introduce Reinforcement Sampling (RS), a novel method leveraging a generalizable action value function to sample improved decisions. RS is able to improve the decision quality whenever the action value estimation is accurate. It works by improving the agent's decision on the fly on the states the agent is visiting. Compared with the historically experienced states in which conventional policy gradient methods improve the policy, the currently visited states are more relevant to the agent. Our method sufficiently exploits the generalizability of the value function on unseen states and sheds new light on the future development of generalizable reinforcement learning. \ No newline at end of file diff --git a/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure b/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure new file mode 100644 index 0000000000..536cb4f7fc --- /dev/null +++ b/data/2024/aaai/Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure @@ -0,0 +1 @@ +In this paper, the worst-case probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. More specifically, the worst-case probability measure is a Gibbs probability measure and the unique solution to the maximization of the expected loss under a relative entropy constraint with respect to a reference probability measure. Fundamental generalization metrics, such as the sensitivity of the expected loss, the sensitivity of the empirical risk, and the generalization gap are shown to have closed-form expressions involving the worst-case data-generating probability measure. Existing results for the Gibbs algorithm, such as characterizing the generalization gap as a sum of mutual information and lautum information, up to a constant factor, are recovered. A novel parallel is established between the worst-case data-generating probability measure and the Gibbs algorithm. Specifically, the Gibbs probability measure is identified as a fundamental commonality of the model space and the data space for machine learning algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction b/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction new file mode 100644 index 0000000000..063c38ed39 --- /dev/null +++ b/data/2024/aaai/Generalize for Future: Slow and Fast Trajectory Learning for CTR Prediction @@ -0,0 +1 @@ +Deep neural networks (DNNs) have achieved significant advancements in click-through rate (CTR) prediction by demonstrating strong generalization on training data. However, in real-world scenarios, the assumption of independent and identically distributed (i.i.d.) conditions, which is fundamental to this problem, is often violated due to temporal distribution shifts. This violation can lead to suboptimal model performance when optimizing empirical risk without access to future data, resulting in overfitting on the training data and convergence to a single sharp minimum. To address this challenge, we propose a novel model updating framework called Slow and Fast Trajectory Learning (SFTL) network. SFTL aims to mitigate the discrepancy between past and future domains while quickly adapting to recent changes in small temporal drifts. This mechanism entails two interactions among three complementary learners: (i) the Working Learner, which updates model parameters using modern optimizers (e.g., Adam, Adagrad) and serves as the primary learner in the recommendation system, (ii) the Slow Learner, which is updated in each temporal domain by directly assigning the model weights of the working learner, and (iii) the Fast Learner, which is updated in each iteration by assigning exponentially moving average weights of the working learner. Additionally, we propose a novel rank-based trajectory loss to facilitate interaction between the working learner and trajectory learner, aiming to adapt to temporal drift and enhance performance in the current domain compared to the past. We provide theoretical understanding and conduct extensive experiments on real-world CTR prediction datasets to validate the effectiveness and efficiency of SFTL in terms of both convergence speed and model performance. The results demonstrate the superiority of SFTL over existing approaches. \ No newline at end of file diff --git a/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons b/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons new file mode 100644 index 0000000000..fb42a143f8 --- /dev/null +++ b/data/2024/aaai/Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons @@ -0,0 +1 @@ +Many applications, e.g. in content recommendation, sports, or recruitment, leverage the comparisons of alternatives to score those alternatives. The classical Bradley-Terry model and its variants have been widely used to do so. The historical model considers binary comparisons (victory/defeat) between alternatives, while more recent developments allow finer comparisons to be taken into account. In this article, we introduce a probabilistic model encompassing a broad variety of paired comparisons that can take discrete or continuous values. We do so by considering a well-behaved subset of the exponential family, which we call the family of generalized Bradley-Terry (GBT) models, as it includes the classical Bradley-Terry model and many of its variants. Remarkably, we prove that all GBT models are guaranteed to yield a strictly convex negative log-likelihood. Moreover, assuming a Gaussian prior on alternatives' scores, we prove that the maximum a posteriori (MAP) of GBT models, whose existence, uniqueness and fast computation are thus guaranteed, varies monotonically with respect to comparisons (the more A beats B, the better the score of A) and is Lipschitz-resilient with respect to each new comparison (a single new comparison can only have a bounded effect on all the estimated scores). These desirable properties make GBT models appealing for practical use. We illustrate some features of GBT models on simulations. \ No newline at end of file diff --git a/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus b/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus new file mode 100644 index 0000000000..966179c1aa --- /dev/null +++ b/data/2024/aaai/Generalized Planning for the Abstraction and Reasoning Corpus @@ -0,0 +1 @@ +The Abstraction and Reasoning Corpus (ARC) is a general artificial intelligence benchmark that poses difficulties for pure machine learning methods due to its requirement for fluid intelligence with a focus on reasoning and abstraction. In this work, we introduce an ARC solver, Generalized Planning for Abstract Reasoning (GPAR). It casts an ARC problem as a generalized planning (GP) problem, where a solution is formalized as a planning program with pointers. We express each ARC problem using the standard Planning Domain Definition Language (PDDL) coupled with external functions representing object-centric abstractions. We show how to scale up GP solvers via domain knowledge specific to ARC in the form of restrictions over the actions model, predicates, arguments and valid structure of planning programs. Our experiments demonstrate that GPAR outperforms the state-of-the-art solvers on the object-centric tasks of the ARC, showing the effectiveness of GP and the expressiveness of PDDL to model ARC problems. The challenges provided by the ARC benchmark motivate research to advance existing GP solvers and understand new relations with other planning computational models. Code is available at github.com/you68681/GPAR. \ No newline at end of file diff --git a/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models b/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models new file mode 100644 index 0000000000..c181202be9 --- /dev/null +++ b/data/2024/aaai/Generalized Planning in PDDL Domains with Pretrained Large Language Models @@ -0,0 +1 @@ +Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization. \ No newline at end of file diff --git a/data/2024/aaai/Generalized Variational Inference via Optimal Transport b/data/2024/aaai/Generalized Variational Inference via Optimal Transport new file mode 100644 index 0000000000..327986eaa0 --- /dev/null +++ b/data/2024/aaai/Generalized Variational Inference via Optimal Transport @@ -0,0 +1 @@ +Variational Inference (VI) has gained popularity as a flexible approximate inference scheme for computing posterior distributions in Bayesian models. Original VI methods use Kullback-Leibler (KL) divergence to construct variational objectives. However, KL divergence has zero-forcing behavior and is completely agnostic to the metric of the underlying data distribution, resulting in bad approximations. To alleviate this issue, we propose a new variational objective by using Optimal Transport (OT) distance, which is a metric-aware divergence, to measure the difference between approximate posteriors and priors. The superior performance of OT distance enables us to learn more accurate approximations. We further enhance the objective by gradually including the OT term using a hyperparameter λ for over-parameterized models. We develop a Variational inference method with OT (VOT) which presents a gradient-based black-box framework for solving Bayesian models, even when the density function of approximate distribution is not available. We provide the consistency analysis of approximate posteriors and demonstrate the practical effectiveness on Bayesian neural networks and variational autoencoders. \ No newline at end of file diff --git a/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators b/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators new file mode 100644 index 0000000000..8b9fbfce77 --- /dev/null +++ b/data/2024/aaai/Generalizing across Temporal Domains with Koopman Operators @@ -0,0 +1 @@ +In the field of domain generalization, the task of constructing a predictive model capable of generalizing to a target domain without access to target data remains challenging. This problem becomes further complicated when considering evolving dynamics between domains. While various approaches have been proposed to address this issue, a comprehensive understanding of the underlying generalization theory is still lacking. In this study, we contribute novel theoretic results that aligning conditional distribution leads to the reduction of generalization bounds. Our analysis serves as a key motivation for solving the Temporal Domain Generalization (TDG) problem through the application of Koopman Neural Operators, resulting in Temporal Koopman Networks (TKNets). By employing Koopman Neural Operators, we effectively address the time-evolving distributions encountered in TDG using the principles of Koopman theory, where measurement functions are sought to establish linear transition relations between evolving domains. Through empirical evaluations conducted on synthetic and real-world datasets, we validate the effectiveness of our proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks b/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks new file mode 100644 index 0000000000..1d4386352d --- /dev/null +++ b/data/2024/aaai/Generating Diagnostic and Actionable Explanations for Fair Graph Neural Networks @@ -0,0 +1 @@ +A plethora of fair graph neural networks (GNNs) have been proposed to promote algorithmic fairness for high-stake real-life contexts. Meanwhile, explainability is generally proposed to help machine learning practitioners debug models by providing human-understandable explanations. However, seldom work on explainability is made to generate explanations for fairness diagnosis in GNNs. From the explainability perspective, this paper explores the problem of what subgraph patterns cause the biased behavior of GNNs, and what actions could practitioners take to rectify the bias? By answering the two questions, this paper aims to produce compact, diagnostic, and actionable explanations that are responsible for discriminatory behavior. Specifically, we formulate the problem of generating diagnostic and actionable explanations as a multi-objective combinatorial optimization problem. To solve the problem, a dedicated multi-objective evolutionary algorithm is presented to ensure GNNs' explainability and fairness in one go. In particular, an influenced nodes-based gradient approximation is developed to boost the computation efficiency of the evolutionary algorithm. We provide a theoretical analysis to illustrate the effectiveness of the proposed framework. Extensive experiments have been conducted to demonstrate the superiority of the proposed method in terms of classification performance, fairness, and interpretability. \ No newline at end of file diff --git a/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models b/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models new file mode 100644 index 0000000000..cbf889df63 --- /dev/null +++ b/data/2024/aaai/Generating Images of Rare Concepts Using Pre-trained Diffusion Models @@ -0,0 +1 @@ +Text-to-image diffusion models can synthesize high quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models. \ No newline at end of file diff --git a/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers b/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers new file mode 100644 index 0000000000..16fa665f3d --- /dev/null +++ b/data/2024/aaai/Generating Universal Adversarial Perturbations for Quantum Classifiers @@ -0,0 +1 @@ +Quantum Machine Learning (QML) has emerged as a promising field of research, aiming to leverage the capabilities of quantum computing to enhance existing machine learning methodologies. Recent studies have revealed that, like their classical counterparts, QML models based on Parametrized Quantum Circuits (PQCs) are also vulnerable to adversarial attacks. Moreover, the existence of Universal Adversarial Perturbations (UAPs) in the quantum domain has been demonstrated theoretically in the context of quantum classifiers. In this work, we introduce QuGAP: a novel framework for generating UAPs for quantum classifiers. We conceptualize the notion of additive UAPs for PQC-based classifiers and theoretically demonstrate their existence. We then utilize generative models (QuGAP-A) to craft additive UAPs and experimentally show that quantum classifiers are susceptible to such attacks. Moreover, we formulate a new method for generating unitary UAPs (QuGAP-U) using quantum generative models and a novel loss function based on fidelity constraints. We evaluate the performance of the proposed framework and show that our method achieves state-of-the-art misclassification rates, while maintaining high fidelity between legitimate and adversarial samples. \ No newline at end of file diff --git a/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge b/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge new file mode 100644 index 0000000000..d24523fd07 --- /dev/null +++ b/data/2024/aaai/Generation of Visual Representations for Multi-Modal Mathematical Knowledge @@ -0,0 +1 @@ +In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web front-end user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Trees with just one click, utilising a back-end engine based on RST. A table of cognitive costs, based on the cognitive Representational Interpretive Structure Theory (RIST), that a representation places on a particular profile of user is produced at the same time. MaRE is general and domain independent, applicable to other representations encoded in RST. It may enhance mathematical education and research, facilitating multi-modal knowledge representation and discovery. \ No newline at end of file diff --git a/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning b/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning new file mode 100644 index 0000000000..74b8d56319 --- /dev/null +++ b/data/2024/aaai/Generative Calibration of Inaccurate Annotation for Label Distribution Learning @@ -0,0 +1 @@ +Label distribution learning (LDL) is an effective learning paradigm for handling label ambiguity. When applying LDL, it typically requires datasets annotated with label distributions. However, obtaining supervised data for LDL is a challenging task. Due to the randomness of label annotation, the annotator can produce inaccurate annotation results for the instance, affecting the accuracy and generalization ability of the LDL model. To address this problem, we propose a generative approach to calibrate the inaccurate annotation for LDL using variational inference techniques. Specifically, we assume that instances with similar features share latent similar label distributions. The feature vectors and label distributions are generated by Gaussian mixture and Dirichlet mixture, respectively. The relationship between them is established through a shared categorical variable, which effectively utilizes the label distribution of instances with similar features, and achieves a more accurate label distribution through the generative approach. Furthermore, we use a confusion matrix to model the factors that contribute to the inaccuracy during the annotation process, which captures the relationship between label distributions and inaccurate label distributions. Finally, the label distribution is used to calibrate the available information in the noisy dataset to obtain the ground-truth label distribution. \ No newline at end of file diff --git a/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality b/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality new file mode 100644 index 0000000000..98dd3e6d2f --- /dev/null +++ b/data/2024/aaai/Generative Model Perception Rectification Algorithm for Trade-Off between Diversity and Quality @@ -0,0 +1 @@ +How to balance the diversity and quality of results from generative models through perception rectification poses a significant challenge. Abnormal perception in generative models is typically caused by two factors: inadequate model structure and imbalanced data distribution. In response to this issue, we propose the dynamic model perception rectification algorithm (DMPRA) for generalized generative models. The core idea is to gain a comprehensive perception of the data in the generative model by appropriately highlighting the low-density samples in the perception space, also known as the minor group samples. The entire process can be summarized as "search-evaluation-adjustment". To identify low-density regions in the data manifold within the perception space of generative models, we introduce a filtering method based on extended neighborhood sampling. Based on the informational value of samples from low-density regions, our proposed mechanism generates informative weights to assess the significance of these samples in correcting the models' perception. By using dynamic adjustment, DMPRA ensures simultaneous enhancement of diversity and quality in the presence of imbalanced data distribution. Experimental results indicate that the algorithm has effectively improved Generative Adversarial Nets (GANs), Normalizing Flows (Flows), Variational Auto-Encoders (VAEs), and Diffusion Models (Diffusion). \ No newline at end of file diff --git a/data/2024/aaai/Generative Model for Decision Trees b/data/2024/aaai/Generative Model for Decision Trees new file mode 100644 index 0000000000..6cbc176abe --- /dev/null +++ b/data/2024/aaai/Generative Model for Decision Trees @@ -0,0 +1 @@ +Decision trees are among the most popular supervised models due to their interpretability and knowledge representation resembling human reasoning. Commonly-used decision tree induction algorithms are based on greedy top-down strategies. Although these approaches are known to be an efficient heuristic, the resulting trees are only locally optimal and tend to have overly complex structures. On the other hand, optimal decision tree algorithms attempt to create an entire decision tree at once to achieve global optimality. We place our proposal between these approaches by designing a generative model for decision trees. Our method first learns a latent decision tree space through a variational architecture using pre-trained decision tree models. Then, it adopts a genetic procedure to explore such latent space to find a compact decision tree with good predictive performance. We compare our proposal against classical tree induction methods, optimal approaches, and ensemble models. The results show that our proposal can generate accurate and shallow, i.e., interpretable, decision trees. \ No newline at end of file diff --git a/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition b/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition new file mode 100644 index 0000000000..478f4ac688 --- /dev/null +++ b/data/2024/aaai/Generative Model-Based Feature Knowledge Distillation for Action Recognition @@ -0,0 +1 @@ +Knowledge distillation (KD), a technique widely employed in computer vision, has emerged as a de facto standard for improving the performance of small neural networks. However, prevailing KD-based approaches in video tasks primarily focus on designing loss functions and fusing cross-modal information. This overlooks the spatial-temporal feature semantics, resulting in limited advancements in model compression. Addressing this gap, our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model. In particular, the framework is organized into two steps: the initial phase is Feature Representation, wherein a generative model-based attention module is trained to represent feature semantics; Subsequently, the Generative-based Feature Distillation phase encompasses both Generative Distillation and Attention Distillation, with the objective of transferring attention-based feature semantics with the generative model. The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets, proving considerable enhancements in video action recognition task. Moreover, the effectiveness of our proposed framework is validated in the context of more intricate video action detection task. Our code is available at https://github.com/aaai-24/Generative-based-KD. \ No newline at end of file diff --git a/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models b/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models new file mode 100644 index 0000000000..57a991fbb3 --- /dev/null +++ b/data/2024/aaai/Generative Multi-Modal Knowledge Retrieval with Large Language Models @@ -0,0 +1 @@ +Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual learning. Then, we align multi-grained visual features into the textual feature space of the LLM, employing the LLM to capture cross-modal interactions. Subsequently, we construct instruction data with a unified format for model training. Finally, we propose the knowledge-guided generation strategy to impose prior constraints in the decoding steps, thereby promoting the generation of distinctive knowledge clues. Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines. \ No newline at end of file diff --git a/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking b/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking new file mode 100644 index 0000000000..3af825281e --- /dev/null +++ b/data/2024/aaai/Generative-Based Fusion Mechanism for Multi-Modal Tracking @@ -0,0 +1 @@ +Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we combine these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. Based on this, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and four challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance by setting new records on GTOT, LasHeR and RGBD1K. Code will be available at https://github.com/Zhangyong-Tang/GMMT. \ No newline at end of file diff --git a/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch b/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch new file mode 100644 index 0000000000..597b7c0303 --- /dev/null +++ b/data/2024/aaai/Generator Assisted Mixture of Experts for Feature Acquisition in Batch @@ -0,0 +1,2 @@ +Given a set of observations, feature acquisition is about finding the subset of unobserved features which would enhance accuracy. Such problems has been explored in a sequential setting in prior work. Here, the model receives feedback from every new feature acquireed and chooses to explore more features or to predict. However, sequential acquisition is not feasible in some settings where time is of essence. We consider the problem of feature acquisition in batch, where the subset of features to be queried in batch is chosen based on the currently observed features, and then acquired as a batch, followed by prediction. We solve this problem using several technical innovations. First, we use a feature generator to draw a subset of the synthetic features for some examples, which reduces the cost of oracle queries. Second, to make the feature acquisition problem tractable for the large heterogeneous observed features, we partition the data into buckets, by borrowing tools from locality sensitive hashing and then train a mixture of experts model. Third, we design a tractable lower bound of the original objective. +We use a greedy algorithm combined with model training to solve the underlying problem. Experiments with four datasets shows that our approach outperforms these methods in terms of trade off between accuracy and feature acquisition cost. \ No newline at end of file diff --git a/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation b/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation new file mode 100644 index 0000000000..9a5ffeb671 --- /dev/null +++ b/data/2024/aaai/Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation @@ -0,0 +1 @@ +Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff. \ No newline at end of file diff --git a/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection b/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection new file mode 100644 index 0000000000..58ae33a1c2 --- /dev/null +++ b/data/2024/aaai/Geometry-Guided Domain Generalization for Monocular 3D Object Detection @@ -0,0 +1 @@ +Monocular 3D object detection (M3OD) is important for autonomous driving. However, existing deep learning-based methods easily suffer from performance degradation in real-world scenarios due to the substantial domain gap between training and testing. M3OD's domain gaps are complex, including camera intrinsic parameters, extrinsic parameters, image appearance, etc. Existing works primarily focus on the domain gaps of camera intrinsic parameters, ignoring other key factors. Moreover, at the feature level, conventional domain invariant learning methods generally cause the negative transfer issue, due to the ignorance of dependency between geometry tasks and domains. To tackle these issues, in this paper, we propose MonoGDG, a geometry-guided domain generalization framework for M3OD, which effectively addresses the domain gap at both camera and feature levels. Specifically, MonoGDG consists of two major components. One is geometry-based image reprojection, which mitigates the impact of camera discrepancy by unifying intrinsic parameters, randomizing camera orientations, and unifying the field of view range. The other is geometry-dependent feature disentanglement, which overcomes the negative transfer problems by incorporating domain-shared and domain-specific features. Additionally, we leverage a depth-disentangled domain discriminator and a domain-aware geometry regression attention mechanism to account for the geometry-domain dependency. Extensive experiments on multiple autonomous driving benchmarks demonstrate that our method achieves state-of-the-art performance in domain generalization for M3OD. \ No newline at end of file diff --git a/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring b/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring new file mode 100644 index 0000000000..cec97ee143 --- /dev/null +++ b/data/2024/aaai/Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring @@ -0,0 +1 @@ +Reinforcement learning (RL) is broadly employed in human-involved systems to enhance human outcomes. Off-policy evaluation (OPE) has been pivotal for RL in those realms since online policy learning and evaluation can be high-stake. Intelligent tutoring has raised tremendous attentions as highly challenging when applying OPE to human-involved systems, due to that students' subgroups can favor different pedagogical policies and the costly procedure that policies have to be induced fully offline and then directly deployed to the upcoming semester. In this work, we formulate on-demand pedagogical policy selection (ODPS) to tackle the challenges for OPE in intelligent tutoring. We propose a pipeline, EduPlanner, as a concrete solution for ODPS. Our pipeline results in an theoretically unbiased estimator, and enables efficient and customized policy selection by identifying subgroups over both historical data and on-arrival initial logs. We evaluate our approach on the Probability ITS that has been used in real classrooms for over eight years. Our study shows significant improvement on learning outcomes of students with EduPlanner, especially for the ones associated with low-performing subgroups. \ No newline at end of file diff --git a/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images b/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images new file mode 100644 index 0000000000..6399908679 --- /dev/null +++ b/data/2024/aaai/GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images @@ -0,0 +1 @@ +Performing person detection in super-high-resolution images has been a challenging task. For such a task, modern detectors, which usually encode a box using center and width/height, struggle with accuracy due to two factors: 1) Human characteristic: people come in various postures and the center with high freedom is difficult to capture robust visual pattern; 2) Image characteristic: due to vast scale diversity of input (gigapixel-level), distance regression (for width and height) is hard to pinpoint, especially for a person, with substantial scale, who is near the camera. To address these challenges, we propose GigaHumanDet, an innovative solution aimed at further enhancing detection accuracy for gigapixel-level images. GigaHumanDet employs the corner modeling method to avoid the potential issues of a high degree of freedom in center pinpointing. To better distinguish similar-looking persons and enforce instance consistency of corner pairs, an instance-guided learning approach is designed to capture discriminative individual semantics. Further, we devise reliable shape-aware bodyness equipped with a multi-precision strategy as the human corner matching guidance to be appropriately adapted to the single-view large scene. Experimental results on PANDA and STCrowd datasets show the superiority and strong applicability of our design. Notably, our model achieves 82.4% in term of AP, outperforming current state-of-the-arts by more than 10%. \ No newline at end of file diff --git a/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series b/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series new file mode 100644 index 0000000000..43bb65b664 --- /dev/null +++ b/data/2024/aaai/GraFITi: Graphs for Forecasting Irregularly Sampled Time Series @@ -0,0 +1 @@ +Forecasting irregularly sampled time series with missing values is a crucial task for numerous real-world applications such as healthcare, astronomy, and climate sciences. State-of-the-art approaches to this problem rely on Ordinary Differential Equations (ODEs) which are known to be slow and often require additional features to handle missing values. To address this issue, we propose a novel model using Graphs for Forecasting Irregularly Sampled Time Series with missing values which we call GraFITi. GraFITi first converts the time series to a Sparsity Structure Graph which is a sparse bipartite graph, and then reformulates the forecasting problem as the edge weight prediction task in the graph. It uses the power of Graph Neural Networks to learn the graph and predict the target edge weights. GraFITi has been tested on 3 real-world and 1 synthetic irregularly sampled time series dataset with missing values and compared with various state-of-the-art models. The experimental results demonstrate that GraFITi improves the forecasting accuracy by up to 17% and reduces the run time up to 5 times compared to the state-of-the-art forecasting models. \ No newline at end of file diff --git a/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation b/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation new file mode 100644 index 0000000000..3e25be11d6 --- /dev/null +++ b/data/2024/aaai/Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation @@ -0,0 +1 @@ +Recently, Table Structure Recognition (TSR) task, aiming at identifying table structure into machine readable formats, has received increasing interest in the community. While impressive success, most single table component-based methods can not perform well on unregularized table cases distracted by not only complicated inner structure but also exterior capture distortion. In this paper, we raise it as Complex TSR problem, where the performance degeneration of existing methods is attributable to their inefficient component usage and redundant post-processing. To mitigate it, we shift our perspective from table component extraction towards the efficient multiple components leverage, which awaits further exploration in the field. Specifically, we propose a seminal method, termed GrabTab, equipped with newly proposed Component Deliberator, to handle various types of tables in a unified framework. Thanks to its progressive deliberation mechanism, our GrabTab can flexibly accommodate to most complex tables with reasonable components selected but without complicated post-processing involved. Quantitative experimental results on public benchmarks demonstrate that our method significantly outperforms the state-of-the-arts, especially under more challenging scenes. \ No newline at end of file diff --git a/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent b/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent new file mode 100644 index 0000000000..1d3748cbad --- /dev/null +++ b/data/2024/aaai/GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent @@ -0,0 +1 @@ +Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees. In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters. Our approach outperforms existing methods on binary classification benchmarks and achieves competitive results for multi-class tasks. The implementation is available under: https://github.com/s-marton/GradTree \ No newline at end of file diff --git a/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness b/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness new file mode 100644 index 0000000000..5d9c39be9c --- /dev/null +++ b/data/2024/aaai/Gradient-Guided Modality Decoupling for Missing-Modality Robustness @@ -0,0 +1 @@ +Multimodal learning with incomplete input data (missing modality) is very practical and challenging. In this work, we conduct an in-depth analysis of this challenge and find that modality dominance has a significant negative impact on the model training, greatly degrading the missing modality performance. Motivated by Grad-CAM, we introduce a novel indicator, gradients, to monitor and reduce modality dominance which widely exists in the missing-modality scenario. In aid of this indicator, we present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities. Specifically, GMD removes the conflicted gradient components from different modalities to achieve this decoupling, significantly improving the performance. In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing (DS) framework which can adaptively switch on/off the network parameters based on whether one modality is available. We conduct extensive experiments on three popular multimodal benchmarks, including BraTS 2018 for medical segmentation, CMU-MOSI, and CMU-MOSEI for sentiment analysis. The results show that our method can significantly outperform the competitors, showing the effectiveness of the proposed solutions. Our code is released here: https://github.com/HaoWang420/Gradient-guided-Modality-Decoupling. \ No newline at end of file diff --git a/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing b/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing new file mode 100644 index 0000000000..fa74e25aa0 --- /dev/null +++ b/data/2024/aaai/Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing @@ -0,0 +1 @@ +GAN-based image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer b/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer new file mode 100644 index 0000000000..eb4ddbddaf --- /dev/null +++ b/data/2024/aaai/Gramformer: Learning Crowd Counting via Graph-Modulated Transformer @@ -0,0 +1 @@ +Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at https://github.com/LoraLinH/Gramformer. \ No newline at end of file diff --git a/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract) b/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract) new file mode 100644 index 0000000000..05baed8d65 --- /dev/null +++ b/data/2024/aaai/Graph Anomaly Detection via Prototype-Aware Label Propagation (Student Abstract) @@ -0,0 +1 @@ +Detecting anomalies on attributed graphs is a challenging task since labelled anomalies are highly labour-intensive by taking specialized domain knowledge to make anomalous samples not as available as normal ones. Moreover, graphs contain complex structure information as well as attribute information, leading to anomalies that can be typically hidden in the structure space, attribute space, and the mix of both. In this paper, we propose a novel model for graph anomaly detection named ProGAD. Specifically, ProGAD takes advance of label propagation to infer high-quality pseudo labels by considering the structure and attribute inconsistencies between normal and abnormal samples. Meanwhile, ProGAD introduces the prior knowledge of class distribution to correct and refine pseudo labels with a prototype-aware strategy. Experiments demonstrate that ProGAD achieves strong performance compared with the current state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract) b/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract) new file mode 100644 index 0000000000..64cced81b4 --- /dev/null +++ b/data/2024/aaai/Graph Anomaly Detection with Diffusion Model-Based Graph Enhancement (Student Abstract) @@ -0,0 +1 @@ +Graph anomaly detection has gained significant research interest across various domains. Due to the lack of labeled data, contrastive learning has been applied in detecting anomalies and various scales of contrastive strategies have been initiated. However, these methods might force two instances (e.g., node-level and subgraph-level representations) with different category labels to be consistent during model training, which can adversely impact the model robustness. To tackle this problem, we present a novel contrastive learning framework with the Diffusion model-based graph Enhancement module for Graph Anomaly Detection, DEGAD. In this framework, we design a diffusion model-based graph enhancement module to manipulate neighbors to generate enhanced graphs, which can efficiently alleviate the inconsistent problem. Further, based on the enhanced graphs, we present a multi-scale contrastive module to discriminate anomalies. Experimental results demonstrate the superiority of our model. \ No newline at end of file diff --git a/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization b/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization new file mode 100644 index 0000000000..b3275f7ec2 --- /dev/null +++ b/data/2024/aaai/Graph Bayesian Optimization for Multiplex Influence Maximization @@ -0,0 +1,4 @@ +Influence maximization (IM) is the problem of identifying a limited number of initial influential users within a social network to maximize the number of influenced users. However, previous research has mostly focused on individual information propagation, neglecting the simultaneous and interactive dissemination of multiple information items. In reality, when users encounter a piece of information, such as a smartphone product, they often associate it with related products in their minds, such as earphones or computers from the same brand. Additionally, information platforms frequently recommend related content to users, amplifying this cascading effect and leading to multiplex influence diffusion. + +This paper first formulates the Multiplex Influence Maximization (Multi-IM) problem using multiplex diffusion models with an information association mechanism. In this problem, the seed set is a combination of influential users and information. To effectively manage the combinatorial complexity, we propose Graph Bayesian Optimization for Multi-IM (GBIM). The multiplex diffusion process is thoroughly investigated using a highly effective global kernelized attention message-passing module. This module, in conjunction with Bayesian linear regression (BLR), produces a scalable surrogate model. A data acquisition module incorporating the exploration-exploitation trade-off is developed to optimize the seed set further. +Extensive experiments on synthetic and real-world datasets have proven our proposed framework effective. The code is available at https://github.com/zirui-yuan/GBIM. \ No newline at end of file diff --git a/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract) b/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract) new file mode 100644 index 0000000000..8c9ab6c45d --- /dev/null +++ b/data/2024/aaai/Graph Clustering Methods Derived from Column Subset Selection (Student Abstract) @@ -0,0 +1 @@ +Spectral clustering is a powerful clustering technique. It leverages the spectral properties of graphs to partition data points into meaningful clusters. The most common criterion for evaluating multi-way spectral clustering is NCut. Column Subset Selection is an important optimization technique in the domain of feature selection and dimension reduction which aims to identify a subset of columns of a given data matrix that can be used to approximate the entire matrix. We show that column subset selection can be used to compute spectral clustering and use this to obtain new graph clustering algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning b/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning new file mode 100644 index 0000000000..d5cf1f71d2 --- /dev/null +++ b/data/2024/aaai/Graph Context Transformation Learning for Progressive Correspondence Pruning @@ -0,0 +1 @@ +Most of existing correspondence pruning methods only concentrate on gathering the context information as much as possible while neglecting effective ways to utilize such information. In order to tackle this dilemma, in this paper we propose Graph Context Transformation Network (GCT-Net) enhancing context information to conduct consensus guidance for progressive correspondence pruning. Specifically, we design the Graph Context Enhance Transformer which first generates the graph network and then transforms it into multi-branch graph contexts. Moreover, it employs self-attention and cross-attention to magnify characteristics of each graph context for emphasizing the unique as well as shared essential information. To further apply the recalibrated graph contexts to the global domain, we propose the Graph Context Guidance Transformer. This module adopts a confident-based sampling strategy to temporarily screen high-confidence vertices for guiding accurate classification by searching global consensus between screened vertices and remaining ones. The extensive experimental results on outlier removal and relative pose estimation clearly demonstrate the superior performance of GCT-Net compared to state-of-the-art methods across outdoor and indoor datasets. \ No newline at end of file diff --git a/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective b/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective new file mode 100644 index 0000000000..5e359f1d2a --- /dev/null +++ b/data/2024/aaai/Graph Contrastive Invariant Learning from the Causal Perspective @@ -0,0 +1 @@ +Graph contrastive learning (GCL), learning the node representation by contrasting two augmented graphs in a self-supervised way, has attracted considerable attention. GCL is usually believed to learn the invariant representation. However, does this understanding always hold in practice? In this paper, we first study GCL from the perspective of causality. By analyzing GCL with the structural causal model (SCM), we discover that traditional GCL may not well learn the invariant representations due to the non-causal information contained in the graph. How can we fix it and encourage the current GCL to learn better invariant representations? The SCM offers two requirements and motives us to propose a novel GCL method. Particularly, we introduce the spectral graph augmentation to simulate the intervention upon non-causal factors. Then we design the invariance objective and independence objective to better capture the causal factors. Specifically, (i) the invariance objective encourages the encoder to capture the invariant information contained in causal variables, and (ii) the independence objective aims to reduce the influence of confounders on the causal variables. Experimental results demonstrate the effectiveness of our approach on node classification tasks. \ No newline at end of file diff --git a/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation b/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation new file mode 100644 index 0000000000..e092c988fa --- /dev/null +++ b/data/2024/aaai/Graph Disentangled Contrastive Learning with Personalized Transfer for Cross-Domain Recommendation @@ -0,0 +1,2 @@ +Cross-Domain Recommendation (CDR) has been proven to effectively alleviate the data sparsity problem in Recommender System (RS). Recent CDR methods often disentangle user features into domain-invariant and domain-specific features for efficient cross-domain knowledge transfer. Despite showcasing robust performance, three crucial aspects remain unexplored for existing disentangled CDR approaches: i) The significance nuances of the interaction behaviors are ignored in generating disentangled features; ii) +The user features are disentangled irrelevant to the individual items to be recommended; iii) The general knowledge transfer overlooks the user's personality when interacting with diverse items. To this end, we propose a Graph Disentangled Contrastive framework for CDR (GDCCDR) with personalized transfer by meta-networks. An adaptive parameter-free filter is proposed to gauge the significance of diverse interactions, thereby facilitating more refined disentangled representations. In sight of the success of Contrastive Learning (CL) in RS, we propose two CL-based constraints for item-aware disentanglement. Proximate CL ensures the coherence of domain-invariant features between domains, while eliminatory CL strives to disentangle features within each domains using mutual information between users and items. Finally, for domain-invariant features, we adopt meta-networks to achieve personalized transfer. Experimental results on four real-world datasets demonstrate the superiority of GDCCDR over state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization b/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization new file mode 100644 index 0000000000..3752de1ba6 --- /dev/null +++ b/data/2024/aaai/Graph Invariant Learning with Subgraph Co-mixup for Out-of-Distribution Generalization @@ -0,0 +1,3 @@ +Graph neural networks (GNNs) have been demonstrated to perform well in graph representation learning, but always lacking in generalization capability when tackling out-of-distribution (OOD) data. Graph invariant learning methods, backed by the invariance principle among defined multiple environments, have shown effectiveness in dealing with this issue. However, existing methods heavily rely on well-predefined or accurately generated environment partitions, which are hard to be obtained in practice, leading to sub-optimal OOD generalization performances. +In this paper, we propose a novel graph invariant learning method based on invariant and variant patterns co-mixup strategy, which is capable of jointly generating mixed multiple environments and capturing invariant patterns from the mixed graph data. Specifically, we first adopt a subgraph extractor to identify invariant subgraphs. Subsequently, we design one novel co-mixup strategy, i.e., jointly conducting environment mixup and invariant mixup. For the environment mixup, we mix the variant environment-related subgraphs so as to generate sufficiently diverse multiple environments, which is important to guarantee the quality of the graph invariant learning. For the invariant mixup, we mix the invariant subgraphs, further encouraging to capture invariant patterns behind graphs while getting rid of spurious correlations for OOD generalization. We demonstrate that the proposed environment mixup and invariant mixup can mutually promote each other. +Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art under various distribution shifts. \ No newline at end of file diff --git a/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs b/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs new file mode 100644 index 0000000000..e30afa8343 --- /dev/null +++ b/data/2024/aaai/Graph Learning in 4D: A Quaternion-Valued Laplacian to Enhance Spectral GCNs @@ -0,0 +1 @@ +We introduce QuaterGCN, a spectral Graph Convolutional Network (GCN) with quaternion-valued weights at whose core lies the Quaternionic Laplacian, a quaternion-valued Laplacian matrix by whose proposal we generalize two widely-used Laplacian matrices: the classical Laplacian (defined for undirected graphs) and the complex-valued Sign-Magnetic Laplacian (proposed within the spectral GCN SigMaNet to handle digraphs with weights of arbitrary sign). In addition to its generality, QuaterGCN is the only Laplacian to completely preserve the (di)graph topology that we are aware of, as it can handle graphs and digraphs containing antiparallel pairs of edges (digons) of different weight without reducing them to a single (directed or undirected) edge as done by other Laplacians. Experimental results show the superior performance of QuaterGCN compared to other state-of-the-art GCNs, particularly in scenarios where the information the digons carry is crucial to successfully address the task at hand. \ No newline at end of file diff --git a/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute b/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute new file mode 100644 index 0000000000..bdd4e21bd4 --- /dev/null +++ b/data/2024/aaai/Graph Neural Networks with Soft Association between Topology and Attribute @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have shown great performance in learning representations for graph-structured data. However, recent studies have found that the interference between topology and attribute can lead to distorted node representations. Most GNNs are designed based on homophily assumptions, thus they cannot be applied to graphs with heterophily. This research critically analyzes the propagation principles of various GNNs and the corresponding challenges from an optimization perspective. A novel GNN called Graph Neural Networks with Soft Association between Topology and Attribute (GNN-SATA) is proposed. Different embeddings are utilized to gain insights into attributes and structures while establishing their interconnections through soft association. Further as integral components of the soft association, a Graph Pruning Module (GPM) and Graph Augmentation Module (GAM) are developed. These modules dynamically remove or add edges to the adjacency relationships to make the model better fit with graphs with homophily or heterophily. Experimental results on homophilic and heterophilic graph datasets convincingly demonstrate that the proposed GNN-SATA effectively captures more accurate adjacency relationships and outperforms state-of-the-art approaches. Especially on the heterophilic graph dataset Squirrel, GNN-SATA achieves a 2.81% improvement in accuracy, utilizing merely 27.19% of the original number of adjacency relationships. Our code is released at https://github.com/wwwfadecom/GNN-SATA. \ No newline at end of file diff --git a/data/2024/aaai/Graph Neural Prompting with Large Language Models b/data/2024/aaai/Graph Neural Prompting with Large Language Models new file mode 100644 index 0000000000..bfa9ac72a4 --- /dev/null +++ b/data/2024/aaai/Graph Neural Prompting with Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs (KGs) to enhance language modeling via joint training and customized model architectures, applying this to LLMs is problematic owing to their large number of parameters and high computational cost. Therefore, how to enhance pre-trained LLMs using grounded knowledge, e.g., retrieval-augmented generation, remains an open question. In this work, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP encompasses various designs, including a standard graph neural network encoder, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks across different LLM sizes and settings. Code is available at https://github.com/meettyj/GNP. \ No newline at end of file diff --git a/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering b/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering new file mode 100644 index 0000000000..f0150b638b --- /dev/null +++ b/data/2024/aaai/Graph Reasoning Transformers for Knowledge-Aware Question Answering @@ -0,0 +1 @@ +Augmenting Language Models (LMs) with structured knowledge graphs (KGs) aims to leverage structured world knowledge to enhance the capability of LMs to complete knowledge-intensive tasks. However, existing methods are unable to effectively utilize the structured knowledge in a KG due to their inability to capture the rich relational semantics of knowledge triplets. Moreover, the modality gap between natural language text and KGs has become a challenging obstacle when aligning and fusing cross-modal information. To address these challenges, we propose a novel knowledge-augmented question answering (QA) model, namely, Graph Reasoning Transformers (GRT). Different from conventional node-level methods, the GRT serves knowledge triplets as atomic knowledge and utilize a triplet-level graph encoder to capture triplet-level graph features. Furthermore, to alleviate the negative effect of the modality gap on joint reasoning, we propose a representation alignment pretraining to align the cross-modal representations and introduce a cross-modal information fusion module with attention bias to enable fine-grained information fusion. Extensive experiments conducted on three knowledge-intensive QA benchmarks show that the GRT outperforms the state-of-the-art KG-augmented QA systems, demonstrating the effectiveness and adaptation of our proposed model. \ No newline at end of file diff --git a/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models b/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models new file mode 100644 index 0000000000..cdd82e3ebc --- /dev/null +++ b/data/2024/aaai/Graph of Thoughts: Solving Elaborate Problems with Large Language Models @@ -0,0 +1,19 @@ +We introduce Graph of Thoughts (GoT): a framework that +advances prompting capabilities in large language models +(LLMs) beyond those offered by paradigms such as +Chain-of-Thought or Tree of Thoughts (ToT). The key idea and +primary advantage of GoT is the ability to model the information +generated by an LLM as an arbitrary graph, where units of +information ("LLM thoughts") are vertices, and edges correspond +to dependencies between these vertices. This approach enables +combining arbitrary LLM thoughts into synergistic outcomes, +distilling the essence of whole networks of thoughts, +or enhancing thoughts using feedback loops. We illustrate +that GoT offers advantages over state of the art on different +tasks, for example increasing the quality of sorting by 62% +over ToT, while simultaneously reducing costs by >31%. +We ensure that GoT is extensible with new thought +transformations and thus can be used to spearhead new prompting +schemes. This work brings the LLM reasoning closer to human +thinking or brain mechanisms such as recurrence, both +of which form complex networks \ No newline at end of file diff --git a/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification b/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification new file mode 100644 index 0000000000..de00d523cf --- /dev/null +++ b/data/2024/aaai/Graph-Aware Contrasting for Multivariate Time-Series Classification @@ -0,0 +1 @@ +Contrastive learning, as a self-supervised learning paradigm, becomes popular for Multivariate Time-Series (MTS) classification. It ensures the consistency across different views of unlabeled samples and then learns effective representations for these samples. Existing contrastive learning methods mainly focus on achieving temporal consistency with temporal augmentation and contrasting techniques, aiming to preserve temporal patterns against perturbations for MTS data. However, they overlook spatial consistency that requires the stability of individual sensors and their correlations. As MTS data typically originate from multiple sensors, ensuring spatial consistency becomes essential for the overall performance of contrastive learning on MTS data. Thus, we propose Graph-Aware Contrasting for spatial consistency across MTS data. Specifically, we propose graph augmentations including node and edge augmentations to preserve the stability of sensors and their correlations, followed by graph contrasting with both node- and graph-level contrasting to extract robust sensor- and global-level features. We further introduce multi-window temporal contrasting to ensure temporal consistency in the data for each sensor. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on various MTS classification tasks. The code is available at https://github.com/Frank-Wang-oss/TS-GAC. \ No newline at end of file diff --git a/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning b/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning new file mode 100644 index 0000000000..f0e6a054dc --- /dev/null +++ b/data/2024/aaai/Graph-Based Prediction and Planning Policy Network (GP3Net) for Scalable Self-Driving in Dynamic Environments Using Deep Reinforcement Learning @@ -0,0 +1 @@ +Recent advancements in motion planning for Autonomous Vehicles (AVs) show great promise in using expert driver behaviors in non-stationary driving environments. However, learning only through expert drivers needs more generalizability to recover from domain shifts and near-failure scenarios due to the dynamic behavior of traffic participants and weather conditions. A deep Graph-based Prediction and Planning Policy Network (GP3Net) framework is proposed for non-stationary environments that encodes the interactions between traffic participants with contextual information and provides a decision for safe maneuver for AV. A spatio-temporal graph models the interactions between traffic participants for predicting the future trajectories of those participants. The predicted trajectories are utilized to generate a future occupancy map around the AV with uncertainties embedded to anticipate the evolving non-stationary driving environments. Then the contextual information and future occupancy maps are input to the policy network of the GP3Net framework and trained using Proximal Policy Optimization (PPO) algorithm. The proposed GP3Net performance is evaluated on standard CARLA benchmarking scenarios with domain shifts of traffic patterns (urban, highway, and mixed). The results show that the GP3Net outperforms previous state-of-the-art imitation learning-based planning models for different towns. Further, in unseen new weather conditions, GP3Net completes the desired route with fewer traffic infractions. Finally, the results emphasize the advantage of including the prediction module to enhance safety measures in non-stationary environments. \ No newline at end of file diff --git a/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments b/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments new file mode 100644 index 0000000000..2d1848ca96 --- /dev/null +++ b/data/2024/aaai/Grey-Box Bayesian Optimization for Sensor Placement in Assisted Living Environments @@ -0,0 +1 @@ +Optimizing the configuration and placement of sensors is crucial for reliable fall detection, indoor localization, and activity recognition in assisted living spaces. We propose a novel, sample-efficient approach to find a high-quality sensor placement in an arbitrary indoor space based on grey-box Bayesian optimization and simulation-based evaluation. Our key technical contribution lies in capturing domain-specific knowledge about the spatial distribution of activities and incorporating it into the iterative selection of query points in Bayesian optimization. Considering two simulated indoor environments and a real-world dataset containing human activities and sensor triggers, we show that our proposed method performs better compared to state-of-the-art black-box optimization techniques in identifying high-quality sensor placements, leading to an accurate activity recognition model in terms of F1-score, while also requiring a significantly lower (51.3% on average) number of expensive function queries. \ No newline at end of file diff --git a/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction b/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction new file mode 100644 index 0000000000..8c0a2b01c8 --- /dev/null +++ b/data/2024/aaai/GridFormer: Point-Grid Transformer for Surface Reconstruction @@ -0,0 +1 @@ +Implicit neural networks have emerged as a crucial technology in 3D surface reconstruction. To reconstruct continuous surfaces from discrete point clouds, encoding the input points into regular grid features (plane or volume) has been commonly employed in existing approaches. However, these methods typically use the grid as an index for uniformly scattering point features. Compared with the irregular point features, the regular grid features may sacrifice some reconstruction details but improve efficiency. To take full advantage of these two types of features, we introduce a novel and high-efficiency attention mechanism between the grid and point features named Point-Grid Transformer (GridFormer). This mechanism treats the grid as a transfer point connecting the space and point cloud. Our method maximizes the spatial expressiveness of grid features and maintains computational efficiency. Furthermore, optimizing predictions over the entire space could potentially result in blurred boundaries. To address this issue, we further propose a boundary optimization strategy incorporating margin binary cross-entropy loss and boundary sampling. This approach enables us to achieve a more precise representation of the object structure. Our experiments validate that our method is effective and outperforms the state-of-the-art approaches under widely used benchmarks by producing more precise geometry reconstructions. The code is available at https://github.com/list17/GridFormer. \ No newline at end of file diff --git a/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection b/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection new file mode 100644 index 0000000000..39086e3435 --- /dev/null +++ b/data/2024/aaai/GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection @@ -0,0 +1 @@ +Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP. \ No newline at end of file diff --git a/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining b/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining new file mode 100644 index 0000000000..8bf2075284 --- /dev/null +++ b/data/2024/aaai/Guiding a Harsh-Environments Robust Detector via RAW Data Characteristic Mining @@ -0,0 +1 @@ +Consumer-grade cameras capture the RAW physical description of a scene and then process the image signals to obtain high-quality RGB images that are faithful to human visual perception. Conventionally, dense prediction scenes require high-precision recognition of objects in RGB images. However, predicting RGB data to exhibit the expected adaptability and robustness in harsh environments can be challenging. By capitalizing on the broader color gamut and higher bit depth offered by RAW data, in this paper, we demonstrate that RAW data can significantly improve the accuracy and robustness of object detectors in harsh environments. Firstly, we propose a general Pipeline for RAW Detection (PRD), along with a preprocessing strategy tailored to RAW data. Secondly, we design the RAW Corruption Benchmark (RCB) to address the dearth of benchmarks that reflect realistic scenarios in harsh environments. Thirdly, we demonstrate the significant improvement of RAW images in object detection for low-light and corrupt scenes. Specifically, our experiments indicate that PRD (using FCOS) outperforms RGB detection by 13.9mAP on LOD-Snow without generating restored images. Finally, we introduce a new nonlinear method called Functional Regularization (FR), which can effectively mine the unique characteristics of RAW data. The code is available at https://github.com/DreamerCCC/RawMining. \ No newline at end of file diff --git a/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles b/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles new file mode 100644 index 0000000000..aedaaddafb --- /dev/null +++ b/data/2024/aaai/GxVAEs: Two Joint VAEs Generate Hit Molecules from Gene Expression Profiles @@ -0,0 +1 @@ +The de novo generation of hit-like molecules that show bioactivity and drug-likeness is an important task in computer-aided drug discovery. Although artificial intelligence can generate molecules with desired chemical properties, most previous studies have ignored the influence of disease-related cellular environments. This study proposes a novel deep generative model called GxVAEs to generate hit-like molecules from gene expression profiles by leveraging two joint variational autoencoders (VAEs). The first VAE, ProfileVAE, extracts latent features from gene expression profiles. The extracted features serve as the conditions that guide the second VAE, which is called MolVAE, in generating hit-like molecules. GxVAEs bridge the gap between molecular generation and the cellular environment in a biological system, and produce molecules that are biologically meaningful in the context of specific diseases. Experiments and case studies on the generation of therapeutic molecules show that GxVAEs outperforms current state-of-the-art baselines and yield hit-like molecules with potential bioactivity and drug-like properties. We were able to successfully generate the potential molecular structures with therapeutic effects for various diseases from patients’ disease profiles. \ No newline at end of file diff --git a/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion b/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion new file mode 100644 index 0000000000..ed48a2b024 --- /dev/null +++ b/data/2024/aaai/H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion @@ -0,0 +1 @@ +3D Semantic Scene Completion (SSC) has emerged as a novel task in vision-based holistic 3D scene understanding. Its objective is to densely predict the occupancy and category of each voxel in a 3D scene based on input from either LiDAR or images. Currently, many transformer-based semantic scene completion frameworks employ simple yet popular Cross-Attention and Self-Attention mechanisms to integrate and infer dense geometric and semantic information of voxels. However, they overlook the distinctions among voxels in the scene, especially in outdoor scenarios where the horizontal direction contains more variations. And voxels located at object boundaries and within the interior of objects exhibit varying levels of positional significance. To address this issue, we propose a transformer-based SSC framework called H2GFormer that incorporates a horizontal-to-global approach. This framework takes into full consideration the variations of voxels in the horizontal direction and the characteristics of voxels on object boundaries. We introduce a horizontal window-to-global attention (W2G) module that effectively fuses semantic information by first diffusing it horizontally from reliably visible voxels and then propagating the semantic understanding to global voxels, ensuring a more reliable fusion of semantic-aware features. Moreover, an Internal-External Position Awareness Loss (IoE-PALoss) is utilized during network training to emphasize the critical positions within the transition regions between objects. The experiments conducted on the SemanticKITTI dataset demonstrate that H2GFormer exhibits superior performance in both geometric and semantic completion tasks. Our code is available on https://github.com/Ryanwy1/H2GFormer. \ No newline at end of file diff --git a/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation b/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation new file mode 100644 index 0000000000..c723b1b76c --- /dev/null +++ b/data/2024/aaai/HACDR-Net: Heterogeneous-Aware Convolutional Network for Diabetic Retinopathy Multi-Lesion Segmentation @@ -0,0 +1 @@ +Diabetic Retinopathy (DR), the leading cause of blindness in diabetic patients, is diagnosed by the condition of retinal multiple lesions. As a difficult task in medical image segmentation, DR multi-lesion segmentation faces the main concerns as follows. On the one hand, retinal lesions vary in location, shape, and size. On the other hand, because some lesions occupy only a very small part of the entire fundus image, the high proportion of background leads to difficulties in lesion segmentation. To solve the above problems, we propose a heterogeneous-aware convolutional network (HACDR-Net) that composes heterogeneous cross-convolution, heterogeneous modulated deformable convolution, and optional near-far-aware convolution. Our network introduces an adaptive aggregation module to summarize the heterogeneous feature maps and get diverse lesion areas in the heterogeneous receptive field along the channels and space. In addition, to solve the problem of the highly imbalanced proportion of focal areas, we design a new medical image segmentation loss function, Noise Adjusted Loss (NALoss). NALoss balances the predictive feature distribution of background and lesion by jointing Gaussian noise and hard example mining, thus enhancing awareness of lesions. We conduct the experiments on the public datasets IDRiD and DDR, and the experimental results show that the proposed method achieves better performance than other state-of-the-art methods. The code is open-sourced on github.com/xqh180110910537/HACDR-Net. \ No newline at end of file diff --git a/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning b/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning new file mode 100644 index 0000000000..9b65d699eb --- /dev/null +++ b/data/2024/aaai/HAGO-Net: Hierarchical Geometric Massage Passing for Molecular Representation Learning @@ -0,0 +1 @@ +Molecular representation learning has emerged as a game-changer at the intersection of AI and chemistry, with great potential in applications such as drug design and materials discovery. A substantial obstacle in successfully applying molecular representation learning is the difficulty of effectively and completely characterizing and learning molecular geometry, which has not been well addressed to date. To overcome this challenge, we propose a novel framework that features a novel geometric graph, termed HAGO-Graph, and a specifically designed geometric graph learning model, HAGO-Net. In the framework, the foundation is HAGO-Graph, which enables a complete characterization of molecular geometry in a hierarchical manner. Specifically, we leverage the concept of n-body in physics to characterize geometric patterns at multiple spatial scales. We then specifically design a message passing scheme, HAGO-MPS, and implement the scheme as a geometric graph neural network, HAGO-Net, to effectively learn the representation of HAGO-Graph by horizontal and vertical aggregation. We further prove DHAGO-Net, the derivative function of HAGO-Net, is an equivariant model. The proposed models are validated by extensive comparisons on four challenging benchmarks. Notably, the models exhibited state-of-the-art performance in molecular chirality identification and property prediction, achieving state-of-the-art performance on five properties of QM9 dataset. The models also achieved competitive results on molecular dynamics prediction task. \ No newline at end of file diff --git a/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors b/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors new file mode 100644 index 0000000000..2164b093ed --- /dev/null +++ b/data/2024/aaai/HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors @@ -0,0 +1 @@ +The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/Event-AHU/HARDVS. \ No newline at end of file diff --git a/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting b/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting new file mode 100644 index 0000000000..84aec85d92 --- /dev/null +++ b/data/2024/aaai/HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting @@ -0,0 +1 @@ +Multivariate time series (MTS) prediction has been widely adopted in various scenarios. Recently, some methods have employed patching to enhance local semantics and improve model performance. However, length-fixed patch are prone to losing temporal boundary information, such as complete peaks and periods. Moreover, existing methods mainly focus on modeling long-term dependencies across patches, while paying little attention to other dimensions (e.g., short-term dependencies within patches and complex interactions among cross-variavle patches). To address these challenges, we propose a pure MLP-based HDMixer, aiming to acquire patches with richer semantic information and efficiently modeling hierarchical interactions. Specifically, we design a Length-Extendable Patcher (LEP) tailored to MTS, which enriches the boundary information of patches and alleviates semantic incoherence in series. Subsequently, we devise a Hierarchical Dependency Explorer (HDE) based on pure MLPs. This explorer effectively models short-term dependencies within patches, long-term dependencies across patches, and complex interactions among variables. Extensive experiments on 9 real-world datasets demonstrate the superiority of our approach. The code is available at https://github.com/hqh0728/HDMixer. \ No newline at end of file diff --git a/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals b/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals new file mode 100644 index 0000000000..2315966379 --- /dev/null +++ b/data/2024/aaai/HDformer: A Higher-Dimensional Transformer for Detecting Diabetes Utilizing Long-Range Vascular Signals @@ -0,0 +1 @@ +Diabetes mellitus is a global concern, and early detection can prevent serious complications. 50% of those with diabetes live undiagnosed, disproportionately afflicting low-income groups. Non-invasive methods have emerged for timely detection; however, their limited accuracy constrains clinical usage. In this research, we present a novel Higher Dimensional Transformer (HDformer), the first Transformer-based architecture which utilizes long-range photoplethysmography (PPG) to detect diabetes. The long-range PPG maximizes signal contextual information when compared to the less-than 30 second signals commonly used in existing research. To increase the computational efficiency of HDformer’s long-range processing, a new attention module, Time Square Attention (TSA), is invented to achieve linear computational complexity with respect to the token volume while retaining the local/global dependencies. TSA converts the 1D inputs into 2D representations, grouping the adjacent points into a single 2D token. It then generates dynamic patches and feeds them into a gated mixture-of-experts (MoE) network, optimizing the learning on different attention areas. HDformer achieves state-of-the-art results (sensitivity 98.4, accuracy 97.3, specificity 92.8, AUC 0.929) on the standard MIMIC-III dataset, surpassing existing research. Furthermore, we develop an end-to-end solution where a low-cost wearable is prototyped to connect with the HDformer in the Cloud via a mobile app. This scalable, convenient, and affordable approach provides instantaneous detection and continuous monitoring for individuals. It aids doctors in easily screening for diabetes and safeguards underprivileged communities. The enhanced versatility of HDformer allows for efficient processing and learning of long-range signals in general one-dimensional time-series sequences, particularly for all biomedical waveforms. \ No newline at end of file diff --git a/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping b/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping new file mode 100644 index 0000000000..a182ab65b5 --- /dev/null +++ b/data/2024/aaai/HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping @@ -0,0 +1 @@ +Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision. Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing self-supervised transformer features. However, their scopes only build upon patch-level features within an image, neglecting region/image-level and cross-image relationships at a broader scale. Moreover, these methods cannot differentiate various semantics from multiple instances. To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP). Specifically, a novel lightweight head with cross-attention mechanism is designed to adaptively group intra-image patches into semantically coherent regions based on correlation among self-supervised features. Further, to ensure the distinguishability among various regions, we introduce a region-level contrastive clustering loss to pull closer similar regions across images. Also, an image-level contrastive loss is present to push foreground and background representations apart, with which foreground objects and background are accordingly discovered. HEAP facilitates efficient hierarchical image decomposition, which contributes to more accurate object discovery while also enabling differentiation among objects of various classes. Extensive experimental results on semantic segmentation retrieval, unsupervised object discovery, and saliency detection tasks demonstrate that HEAP achieves state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces b/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces new file mode 100644 index 0000000000..271c8ff735 --- /dev/null +++ b/data/2024/aaai/HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces @@ -0,0 +1 @@ +Temporal knowledge graphs represent temporal facts (s,p,o,?) relating a subject s and an object o via a relation label p at time ?, where ? could be a time point or time interval. Temporal knowledge graphs may exhibit static temporal patterns at distinct points in time and dynamic temporal patterns between different timestamps. In order to learn a rich set of static and dynamic temporal patterns and apply them for inference, several embedding approaches have been suggested in the literature. However, as most of them resort to single underlying embedding spaces, their capability to model all kinds of temporal patterns was severely limited by having to adhere to the geometric property of their one embedding space. We lift this limitation by an embedding approach that maps temporal facts into a product space of several heterogeneous geometric subspaces with distinct geometric properties, i.e.\ Complex, Dual, and Split-complex spaces. In addition, we propose a temporal-geometric attention mechanism to integrate information from different geometric subspaces conveniently according to the captured relational and temporal information. Experimental results on standard temporal benchmark datasets favorably evaluate our approach against state-of-the-art models. \ No newline at end of file diff --git a/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning b/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning new file mode 100644 index 0000000000..b8045108ef --- /dev/null +++ b/data/2024/aaai/HGPrompt: Bridging Homogeneous and Heterogeneous Graphs for Few-Shot Prompt Learning @@ -0,0 +1,3 @@ +Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs) are prominent techniques for homogeneous and heterogeneous graph representation learning, yet their performance in an end-to-end supervised framework greatly depends on the availability of task-specific supervision. To reduce the labeling cost, pre-training on self-supervised pretext tasks has become a popular paradigm, but there is often a gap between the pre-trained model and downstream tasks, stemming from the divergence in their objectives. To bridge the gap, prompt learning has risen as a promising direction especially in few-shot settings, without the need to fully fine-tune the pre-trained model. While there has been some early exploration of prompt-based learning on graphs, they primarily deal with homogeneous graphs, ignoring the heterogeneous graphs that are prevalent in downstream applications. In this paper, we propose HGPROMPT, a +novel pre-training and prompting framework to unify not only pre-training and downstream tasks but also homogeneous and heterogeneous graphs via a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to assist a downstream task in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive experiments +on three public datasets. \ No newline at end of file diff --git a/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction b/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction new file mode 100644 index 0000000000..e4aded37e5 --- /dev/null +++ b/data/2024/aaai/HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction @@ -0,0 +1 @@ +Neural reconstruction and rendering strategies have demonstrated state-of-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present a new hybrid implicit surface representation to model human shapes. This representation is composed of two surface layers that represent opaque and translucent regions on the clothed human body. We segment different regions automatically using visual cues and learn to reconstruct two signed distance functions (SDFs). We perform surface-based rendering on opaque regions (e.g., body, face, clothes) to preserve high-fidelity surface normals and volume rendering on translucent regions (e.g., hair). Experiments demonstrate that our approach obtains state-of-the-art results on 3D human reconstructions, and also shows competitive performances on other objects. \ No newline at end of file diff --git a/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors b/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors new file mode 100644 index 0000000000..71cc1c6033 --- /dev/null +++ b/data/2024/aaai/HONGAT: Graph Attention Networks in the Presence of High-Order Neighbors @@ -0,0 +1 @@ +Graph Attention Networks (GATs) that compute node representation by its lower-order neighbors, are state-of-the-art architecture for representation learning with graphs. In practice, however, the high-order neighbors that turn out to be useful, remain largely unemployed in GATs. Efforts on this issue remain to be limited. This paper proposes a simple and effective high-order neighbor GAT (HONGAT) model to both effectively exploit informative high-order neighbors and address over-smoothing at the decision boundary of nodes. Two tightly coupled novel technologies, namely common neighbor similarity and new masking matrix, are introduced. Specifically, high-order neighbors are fully explored by generic high-order common-neighbor-based similarity; in order to prevent severe over-smoothing, typical averaging range no longer works well and a new masking mechanism is employed without any extra hyperparameter. Extensive empirical evaluation on real-world datasets clearly shows the necessity of the new algorithm in the ability of exploring high-order neighbors, which promisingly achieves significant gains over previous state-of-the-art graph attention methods. \ No newline at end of file diff --git a/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP b/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP new file mode 100644 index 0000000000..6247e88fb3 --- /dev/null +++ b/data/2024/aaai/HOP to the Next Tasks and Domains for Continual Learning in NLP @@ -0,0 +1 @@ +Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP. \ No newline at end of file diff --git a/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis b/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis new file mode 100644 index 0000000000..c86b568287 --- /dev/null +++ b/data/2024/aaai/HORIZON: High-Resolution Semantically Controlled Panorama Synthesis @@ -0,0 +1 @@ +Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds. Nevertheless, contemporary panoramic synthesis techniques grapple with the challenge of semantically guiding the content generation process. Although recent breakthroughs in visual synthesis have unlocked the potential for semantic control in 2D flat images, a direct application of these methods to panorama synthesis yields distorted content. In this study, we unveil an innovative framework for generating high-resolution panoramas, adeptly addressing the issues of spherical distortion and edge discontinuity through sophisticated spherical modeling. Our pioneering approach empowers users with semantic control, harnessing both image and text inputs, while concurrently streamlining the generation of high-resolution panoramas using parallel decoding. We rigorously evaluate our methodology on a diverse array of indoor and outdoor datasets, establishing its superiority over recent related work, in terms of both quantitative and qualitative performance metrics. Our research elevates the controllability, efficiency, and fidelity of panorama synthesis to new levels. \ No newline at end of file diff --git a/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation b/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation new file mode 100644 index 0000000000..72f05d1c11 --- /dev/null +++ b/data/2024/aaai/HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation @@ -0,0 +1 @@ +Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully-supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro. \ No newline at end of file diff --git a/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling b/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling new file mode 100644 index 0000000000..960523870b --- /dev/null +++ b/data/2024/aaai/Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling @@ -0,0 +1 @@ +Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin. \ No newline at end of file diff --git a/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning b/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning new file mode 100644 index 0000000000..6988bf3550 --- /dev/null +++ b/data/2024/aaai/Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning @@ -0,0 +1 @@ +Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL. \ No newline at end of file diff --git a/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation b/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation new file mode 100644 index 0000000000..3744cb1a5a --- /dev/null +++ b/data/2024/aaai/Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation @@ -0,0 +1 @@ +Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets, it consistently avoids collapse more robustly than other methods and leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments. Code is available at https://github.com/Lou1sM/online_hard_clustering. \ No newline at end of file diff --git a/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL b/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL new file mode 100644 index 0000000000..3f04afc877 --- /dev/null +++ b/data/2024/aaai/Hardness of Random Reordered Encodings of Parity for Resolution and CDCL @@ -0,0 +1 @@ +Parity reasoning is challenging for Conflict-Driven Clause Learning (CDCL) SAT solvers. This has been observed even for simple formulas encoding two contradictory parity constraints with different variable orders (Chew and Heule 2020). We provide an analytical explanation for their hardness by showing that they require exponential resolution refutations with high probability when the variable order is chosen at random. We obtain this result by proving that these formulas, which are known to be Tseitin formulas, have Tseitin graphs of linear treewidth with high probability. Since such Tseitin formulas require exponential resolution refutations, our result follows. We generalize this argument to a new class of formulas that capture a basic form of parity reasoning involving a sum of two random parity constraints with random orders. Even when the variable order for the sum is chosen favorably, these formulas remain hard for resolution. In contrast, we prove that they have short DRAT refutations. We show experimentally that the running time of CDCL SAT solvers on both classes of formulas grows exponentially with their treewidth. \ No newline at end of file diff --git a/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People b/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People new file mode 100644 index 0000000000..4fb4551bce --- /dev/null +++ b/data/2024/aaai/Harmonious Mobility for Robots that Work with and around People @@ -0,0 +1 @@ +The integration of advances from machine learning and computer vision with the classical autonomy stack has brought successful robot deployments in fulfilment, manufacturing, and transportation. However, unstructured and dynamic environments such as pedestrian spaces and streets, workplaces, and homes pose additional challenges such as modeling human behavior, understanding user perceptions, and ensuring human safety and comfort. My work addresses such challenges to enable robots to fluently work with and around people to increase productivity and assist users. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers b/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers new file mode 100644 index 0000000000..0434b33e6f --- /dev/null +++ b/data/2024/aaai/Harnessing Edge Information for Improved Robustness in Vision Transformers @@ -0,0 +1 @@ +Deep Neural Networks (DNNs) have demonstrated remarkable accuracy in vision classification tasks. However, they exhibit vulnerability to additional noises known as adversarial attacks. Previous studies hypothesize that this vulnerability might stem from the fact that high-accuracy DNNs heavily rely on irrelevant and non-robust features, such as textures and the background. In this work, we reveal that edge information extracted from images can provide relevant and robust features related to shapes and the foreground. These features assist pretrained DNNs in achieving improved adversarial robustness without compromising their accuracy on clean images. A lightweight and plug-and-play EdgeNet is proposed, which can be seamlessly integrated into existing pretrained DNNs, including Vision Transformers, a recent family of state-of-the-art models for vision classification. Our EdgeNet can process edges derived from either clean nature images or noisy adversarial images, yielding robust features which can be injected into the intermediate layers of the frozen backbone DNNs. The cost of obtaining such edges using conventional edge detection algorithms (e.g., Canny edge detector) is marginal, and the cost of training the EdgeNet is equivalent to that of fine-tuning the backbone network with techniques such as Adapter. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues b/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues new file mode 100644 index 0000000000..5a2b079c81 --- /dev/null +++ b/data/2024/aaai/Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues @@ -0,0 +1 @@ +Dialogue Aspect-based Sentiment Quadruple (DiaASQ) is a newly-emergent task aiming to extract the sentiment quadruple (i.e., targets, aspects, opinions, and sentiments) from conversations. While showing promising performance, the prior DiaASQ approach unfortunately falls prey to the key crux of DiaASQ, including insufficient modeling of discourse features, and lacking quadruple extraction, which hinders further task improvement. To this end, we introduce a novel framework that not only capitalizes on comprehensive discourse feature modeling, but also captures the intrinsic interaction for optimal quadruple extraction. On the one hand, drawing upon multiple discourse features, our approach constructs a token-level heterogeneous graph and enhances token interactions through a heterogeneous attention network. We further propose a novel triadic scorer, strengthening weak token relations within a quadruple, thereby enhancing the cohesion of the quadruple extraction. Experimental results on the DiaASQ benchmark showcase that our model significantly outperforms existing baselines across both English and Chinese datasets. Our code is available at https://bit.ly/3v27pqA. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models b/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models new file mode 100644 index 0000000000..c986ae05ad --- /dev/null +++ b/data/2024/aaai/Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models @@ -0,0 +1,3 @@ +Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel workloads and dense vector matrix multiplications. Potentially more efficient neural network models utilizing sparsity and recurrence cannot leverage the full power of SIMD processor and are thus at a severe disadvantage compared to today's prominent parallel architectures like Transformers and CNNs, thereby hindering the path towards more sustainable AI. +To overcome this limitation, we explore sparse and recurrent model training on a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory. We implement a training routine based on backpropagation though time (BPTT) for the brain-inspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. We observe a massive advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. Furthermore, our results show highly promising trends for both single and multi IPU configurations as we scale up to larger model sizes. +Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs, and competitive large scale SNN models. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning b/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning new file mode 100644 index 0000000000..b2b13bd5d1 --- /dev/null +++ b/data/2024/aaai/Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning @@ -0,0 +1 @@ +This study aims to minimize the influence of fake news on social networks by deploying debunkers to propagate true news. This is framed as a reinforcement learning problem, where, at each stage, one user is selected to propagate true news. A challenging issue is episodic reward where the "net" effect of selecting individual debunkers cannot be discerned from the interleaving information propagation on social networks, and only the collective effect from mitigation efforts can be observed. Existing Self-Imitation Learning (SIL) methods have shown promise in learning from episodic rewards, but are ill-suited to the real-world application of fake news mitigation because of their poor sample efficiency. To learn a more effective debunker selection policy for fake news mitigation, this study proposes NAGASIL - Negative sampling and state Augmented Generative Adversarial Self-Imitation Learning, which consists of two improvements geared towards fake news mitigation: learning from negative samples, and an augmented state representation to capture the "real" environment state by integrating the current observed state with the previous state-action pairs from the same campaign. Experiments on two social networks show that NAGASIL yields superior performance to standard GASIL and state-of-the-art fake news mitigation models. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification b/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification new file mode 100644 index 0000000000..955fb4de54 --- /dev/null +++ b/data/2024/aaai/Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification @@ -0,0 +1 @@ +Within the scope of natural language processing, the domain of multi-label text classification is uniquely challenging due to its expansive and uneven label distribution. The complexity deepens due to the demand for an extensive set of annotated data for training an advanced deep learning model, especially in specialized fields where the labeling task can be labor-intensive and often requires domain-specific knowledge. Addressing these challenges, our study introduces a novel deep active learning strategy, capitalizing on the Beta family of proper scoring rules within the Expected Loss Reduction framework. It computes the expected increase in scores using the Beta Scoring Rules, which are then transformed into sample vector representations. These vector representations guide the diverse selection of informative sample, directly linking this process to the model's expected proper score. Comprehensive evaluations across both synthetic and real datasets reveal our method's capability to often outperform established acquisition techniques in multi-label text classification, presenting encouraging outcomes across various architectural and dataset scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification b/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification new file mode 100644 index 0000000000..f69af3feb3 --- /dev/null +++ b/data/2024/aaai/Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification @@ -0,0 +1 @@ +Deep learning methods have achieved outstanding performance in various signal tasks. However, due to degraded signals in real electromagnetic environment, it is crucial to seek methods that can improve the representation of signal features. In this paper, a Singular Value decomposition-based Attention, SVA is proposed to explore structure of signal data for adaptively enhancing intrinsic feature. Using a deep neural network as a base model, SVA performs feature semantic subspace learning through a decomposition layer and combines it with an attention layer to achieve adaptive enhancement of signal features. Moreover, we consider the gradient explosion problem brought by SVA and optimize SVA to improve the stability of training. Extensive experimental results demon-strate that applying SVA to a generalized classification model can significantly improve its ability in representations, making its recognition performance competitive with, or even better than, the state-of-the-art task-specific models. \ No newline at end of file diff --git a/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations b/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations new file mode 100644 index 0000000000..bad00195a9 --- /dev/null +++ b/data/2024/aaai/Hawkes-Enhanced Spatial-Temporal Hypergraph Contrastive Learning Based on Criminal Correlations @@ -0,0 +1 @@ +Crime prediction is a crucial yet challenging task within urban computing, which benefits public safety and resource optimization. Over the years, various models have been proposed, and spatial-temporal hypergraph learning models have recently shown outstanding performances. However, three correlations underlying crime are ignored, thus hindering the performance of previous models. Specifically, there are two spatial correlations and one temporal correlation, i.e., (1) co-occurrence of different types of crimes (type spatial correlation), (2) the closer to the crime center, the more dangerous it is around the neighborhood area (neighbor spatial correlation), and (3) the closer between two timestamps, the more relevant events are (hawkes temporal correlation). To this end, we propose Hawkes-enhanced Spatial-Temporal Hypergraph Contrastive Learning framework (HCL), which mines the aforementioned correlations via two specific strategies. Concretely, contrastive learning strategies are designed for two spatial correlations, and hawkes process modeling is adopted for temporal correlations. Extensive experiments demonstrate the promising capacities of HCL from four aspects, i.e., superiority, transferability, effectiveness, and sensitivity. \ No newline at end of file diff --git a/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification b/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification new file mode 100644 index 0000000000..c86bfb81b9 --- /dev/null +++ b/data/2024/aaai/Hear You Say You: An Efficient Framework for Marine Mammal Sounds' Classification @@ -0,0 +1 @@ +Marine mammals and their ecosystem face significant threats from, for example, military active sonar and marine transportation. To mitigate this harm, early detection and classification of marine mammals are essential. While recent efforts have utilized spectrogram analysis and machine learning techniques, there remain challenges in their efficiency. Therefore, we propose a novel knowledge distillation framework, named XCFSMN, for this problem. We construct a teacher model that fuses the features extracted from an X-vector extractor, a DenseNet and Cross-Covariance attended compact Feed-Forward Sequential Memory Network (cFSMN). The teacher model transfers knowledge to a simpler cFSMN model through a temperature-cooling strategy for efficient learning. Compared to multiple convolutional neural network backbones and transformers, the proposed framework achieves state-of-the-art efficiency and performance. The improved model size is approximately 20 times smaller and the inference time can be 10 times shorter without affecting the model’s accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification b/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification new file mode 100644 index 0000000000..46240c4b7e --- /dev/null +++ b/data/2024/aaai/Heterogeneous Test-Time Training for Multi-Modal Person Re-identification @@ -0,0 +1 @@ +Multi-modal person re-identification (ReID) seeks to mitigate challenging lighting conditions by incorporating diverse modalities. Most existing multi-modal ReID methods concentrate on leveraging complementary multi-modal information via fusion or interaction. However, the relationships among heterogeneous modalities and the domain traits of unlabeled test data are rarely explored. In this paper, we propose a Heterogeneous Test-time Training (HTT) framework for multi-modal person ReID. We first propose a Cross-identity Inter-modal Margin (CIM) loss to amplify the differentiation among distinct identity samples. Moreover, we design a Multi-modal Test-time Training (MTT) strategy to enhance the generalization of the model by leveraging the relationships in the heterogeneous modalities and the information existing in the test data. Specifically, in the training stage, we utilize the CIM loss to further enlarge the distance between anchor and negative by forcing the inter-modal distance to maintain the margin, resulting in an enhancement of the discriminative capacity of the ultimate descriptor. Subsequently, since the test data contains characteristics of the target domain, we adapt the MTT strategy to optimize the network before the inference by using self-supervised tasks designed based on relationships among modalities. Experimental results on benchmark multi-modal ReID datasets RGBNT201, Market1501-MM, RGBN300, and RGBNT100 validate the effectiveness of the proposed method. The codes can be found at https://github.com/ziwang1121/HTT. \ No newline at end of file diff --git a/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation b/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation new file mode 100644 index 0000000000..fe61c6d406 --- /dev/null +++ b/data/2024/aaai/HiFi-Gas: Hierarchical Federated Learning Incentive Mechanism Enhanced Gas Usage Estimation @@ -0,0 +1 @@ +Gas usage estimation plays a critical role in various aspects of the power generation and delivery business, including budgeting, resource planning, and environmental preservation. Federated Learning (FL) has demonstrated its potential in enhancing the accuracy and reliability of gas usage estimation by enabling distributedly owned data to be leveraged, while ensuring privacy and confidentiality. However, to effectively motivate stakeholders to contribute their high-quality local data and computational resources for this purpose, incentive mechanism design is key. In this paper, we report our experience designing and deploying the Hierarchical FL Incentive mechanism for Gas usage estimation (HiFi-Gas) system. It is designed to cater to the unique structure of gas companies and their affiliated heating stations. HiFi-Gas provides effective incentivization in a hierarchical federated learning framework that consists of a horizontal federated learning (HFL) component for effective collaboration among gas companies and multiple vertical federated learning (VFL) components for the gas company and its affiliated heating stations. To motivate active participation and ensure fairness among gas companies and heating stations, we incorporate a multi-dimensional contribution-aware reward distribution function that considers both data quality and model contributions. Since its deployment in the ENN Group in December 2022, HiFi-Gas has successfully provided incentives for gas companies and heating stations to actively participate in FL training, resulting in more than 12% higher average gas usage estimation accuracy and substantial gas procurement cost savings. This implementation marks the first successful deployment of a hierarchical FL incentive approach in the energy industry. \ No newline at end of file diff --git a/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval b/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval new file mode 100644 index 0000000000..f295351bdd --- /dev/null +++ b/data/2024/aaai/HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval @@ -0,0 +1 @@ +Existing unsupervised deep product quantization methods primarily aim for the increased similarity between different views of the identical image, whereas the delicate multi-level semantic similarities preserved between images are overlooked. Moreover, these methods predominantly focus on the Euclidean space for computational convenience, compromising their ability to map the multi-level semantic relationships between images effectively. To mitigate these shortcomings, we propose a novel unsupervised product quantization method dubbed Hierarchical Hyperbolic Product Quantization (HiHPQ), which learns quantized representations by incorporating hierarchical semantic similarity within hyperbolic geometry. Specifically, we propose a hyperbolic product quantizer, where the hyperbolic codebook attention mechanism and the quantized contrastive learning on the hyperbolic product manifold are introduced to expedite quantization. Furthermore, we propose a hierarchical semantics learning module, designed to enhance the distinction between similar and non-matching images for a query by utilizing the extracted hierarchical semantics as an additional training supervision. Experiments on benchmark image datasets show that our proposed method outperforms state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain? b/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain? new file mode 100644 index 0000000000..3e42f12b2e --- /dev/null +++ b/data/2024/aaai/Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain? @@ -0,0 +1 @@ +Spatiotemporal social behavior analysis is a technique that studies the social behavior patterns of objects and estimates their risks based on their trajectories. In social public scenarios such as train stations, hidden following behavior has become one of the most challenging issues due to its probability of evolving into violent events, which is more than 25%. In recent years, research on hidden following detection (HFD) has focused on differences in time series between hidden followers and normal pedestrians under two temporal characteristics: gaze and spatial distance. However, the time-domain representation for time series is irreversible and usually causes the loss of critical information. In this paper, we deeply study the expression efficiency of time/frequency domain features of time series, by exploring the recovery mechanism of features to source time series, we establish a fidelity estimation method for feature expression and a selection model for frequency-domain features based on the signal-to-distortion ratio (SDR). Experimental results demonstrate the feature fidelity of time series and HFD performance are positively correlated, and the fidelity of frequency-domain features and HFD performance are significantly better than the time-domain features. On both real and simulated datasets, the accuracy of the proposed method is increased by 3%, and the gaze-only module is improved by 10%. Related research has explored new methods for optimal feature selection based on fidelity, new patterns for efficient feature expression of hidden following behavior, and the mechanism of multimodal collaborative identification. \ No newline at end of file diff --git a/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts b/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts new file mode 100644 index 0000000000..93a4a70ce9 --- /dev/null +++ b/data/2024/aaai/Hierarchical Aligned Multimodal Learning for NER on Tweet Posts @@ -0,0 +1 @@ +Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model. \ No newline at end of file diff --git a/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment b/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment new file mode 100644 index 0000000000..e93d793364 --- /dev/null +++ b/data/2024/aaai/Hierarchical Multi-Marginal Optimal Transport for Network Alignment @@ -0,0 +1 @@ +Finding node correspondence across networks, namely multi-network alignment, is an essential prerequisite for joint learning on multiple networks. Despite great success in aligning networks in pairs, the literature on multi-network alignment is sparse due to the exponentially growing solution space and lack of high-order discrepancy measures. To fill this gap, we propose a hierarchical multi-marginal optimal transport framework named HOT for multi-network alignment. To handle the large solution space, multiple networks are decomposed into smaller aligned clusters via the fused Gromov-Wasserstein (FGW) barycenter. To depict high-order relationships across multiple networks, the FGW distance is generalized to the multi-marginal setting, based on which networks can be aligned jointly. A fast proximal point method is further developed with guaranteed convergence to a local optimum. Extensive experiments and analysis show that our proposed HOT achieves significant improvements over the state-of-the-art in both effectiveness and scalability. \ No newline at end of file diff --git a/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention b/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention new file mode 100644 index 0000000000..d247b9343a --- /dev/null +++ b/data/2024/aaai/Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention @@ -0,0 +1 @@ +This paper addresses the problem of inventing and using hierarchical representations for stochastic robot-planning problems. Rather than using hand-coded state or action representations as input, it presents new methods for learning how to create a high-level action representation for long-horizon, sparse reward robot planning problems in stochastic settings with unknown dynamics. After training, this system yields a robot-specific but environment independent planning system. Given new problem instances in unseen stochastic environments, it first creates zero-shot options (without any experience on the new environment) with dense pseudo-rewards and then uses them to solve the input problem in a hierarchical planning and refinement process. Theoretical results identify sufficient conditions for completeness of the presented approach. Extensive empirical analysis shows that even in settings that go beyond these sufficient conditions, this approach convincingly outperforms baselines by 2x in terms of solution time with orders of magnitude improvement in solution quality. \ No newline at end of file diff --git a/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection b/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection new file mode 100644 index 0000000000..ad3cd58b3d --- /dev/null +++ b/data/2024/aaai/Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection @@ -0,0 +1 @@ +As a trending approach for social event detection, graph neural network (GNN)-based methods enable a fusion of natural language semantics and the complex social network structural information, thus showing SOTA performance. However, GNN-based methods can miss useful message correlations. Moreover, they require manual labeling for training and predetermining the number of events for prediction. In this work, we address social event detection via graph structural entropy (SE) minimization. While keeping the merits of the GNN-based methods, the proposed framework, HISEvent, constructs more informative message graphs, is unsupervised, and does not require the number of events given a priori. Specifically, we incrementally explore the graph neighborhoods using 1-dimensional (1D) SE minimization to supplement the existing message graph with edges between semantically related messages. We then detect events from the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our proposed 1D and 2D SE minimization algorithms are customized for social event detection and effectively tackle the efficiency problem of the existing SE minimization algorithms. Extensive experiments show that HISEvent consistently outperforms GNN-based methods and achieves the new SOTA for social event detection under both closed- and open-set settings while being efficient and robust. \ No newline at end of file diff --git a/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits b/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits new file mode 100644 index 0000000000..66f7939038 --- /dev/null +++ b/data/2024/aaai/Hierarchize Pareto Dominance in Multi-Objective Stochastic Linear Bandits @@ -0,0 +1 @@ +Multi-objective Stochastic Linear bandit (MOSLB) plays a critical role in the sequential decision-making paradigm, however, most existing methods focus on the Pareto dominance among different objectives without considering any priority. In this paper, we study bandit algorithms under mixed Pareto-lexicographic orders, which can reflect decision makers' preferences. We adopt the Grossone approach to deal with these orders and develop the notion of Pareto-lexicographic optimality to evaluate the learners' performance. Our work represents a first attempt to address these important and realistic orders in bandit algorithms. To design algorithms under these orders, the upper confidence bound (UCB) policy and the prior free lexicographical filter are adapted to approximate the optimal arms at each round. Moreover, the framework of the algorithms involves two stages in pursuit of the balance between exploration and exploitation. Theoretical analysis as well as numerical experiments demonstrate the effectiveness of our algorithms. \ No newline at end of file diff --git a/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights b/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights new file mode 100644 index 0000000000..a919fc2d7d --- /dev/null +++ b/data/2024/aaai/High Significant Fault Detection in Azure Core Workload Insights @@ -0,0 +1 @@ +Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception. \ No newline at end of file diff --git a/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm b/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm new file mode 100644 index 0000000000..0e6cbaca8b --- /dev/null +++ b/data/2024/aaai/High-Dimensional Analysis for Generalized Nonlinear Regression: From Asymptotics to Algorithm @@ -0,0 +1 @@ +Overparameterization often leads to benign overfitting, where deep neural networks can be trained to overfit the training data but still generalize well on unseen data. However, it lacks a generalized asymptotic framework for nonlinear regressions and connections to conventional complexity notions. In this paper, we propose a generalized high-dimensional analysis for nonlinear regression models, including various nonlinear feature mapping methods and subsampling. Specifically, we first provide an implicit regularization parameter and asymptotic equivalents related to a classical complexity notion, i.e., effective dimension. We then present a high-dimensional analysis for nonlinear ridge regression and extend it to ridgeless regression in the under-parameterized and over-parameterized regimes, respectively. We find that the limiting risks decrease with the effective dimension. Motivated by these theoretical findings, we propose an algorithm, namely RFRed, to improve generalization ability. Finally, we validate our theoretical findings and the proposed algorithm through several experiments. \ No newline at end of file diff --git a/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field b/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field new file mode 100644 index 0000000000..1fd71ca721 --- /dev/null +++ b/data/2024/aaai/High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field @@ -0,0 +1 @@ +One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned NeRF can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets. Code and data can be found at https://github.com/minghanqin/AvatarSVE. \ No newline at end of file diff --git a/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing b/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing new file mode 100644 index 0000000000..17fac845a9 --- /dev/null +++ b/data/2024/aaai/High-Fidelity Diffusion-Based Image Editing @@ -0,0 +1 @@ +Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features from the original images, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Lastly, we explore our model's generalization though several applications like image-to-image translation and out-of-domain image editing. \ No newline at end of file diff --git a/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning b/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning new file mode 100644 index 0000000000..6204b06aca --- /dev/null +++ b/data/2024/aaai/High-Fidelity Gradient Inversion in Distributed Learning @@ -0,0 +1 @@ +Distributed learning frameworks aim to train global models by sharing gradients among clients while preserving the data privacy of each individual client. However, extensive research has demonstrated that these learning frameworks do not absolutely ensure the privacy, as training data can be reconstructed from shared gradients. Nevertheless, the existing privacy-breaking attack methods have certain limitations. Some are applicable only to small models, while others can only recover images in small batch size and low resolutions, or with low fidelity. Furthermore, when there are some data with the same label in a training batch, existing attack methods usually perform poorly. In this work, we successfully address the limitations of existing attacks by two steps. Firstly, we model the coefficient of variation (CV) of features and design an evolutionary algorithm based on the minimum CV to accurately reconstruct the labels of all training data. After that, we propose a stepwise gradient inversion attack, which dynamically adapts the objective function, thereby effectively and rationally promoting the convergence of attack results towards an optimal solution. With these two steps, our method is able to recover high resolution images (224*224 pixel, from ImageNet and Web) with high fidelity in distributed learning scenarios involving complex models and larger batch size. Experiment results demonstrate the superiority of our approach, reveal the potential vulnerabilities of the distributed learning paradigm, and emphasize the necessity of developing more secure mechanisms. Source code is available at https://github.com/MiLab-HITSZ/2023YeHFGradInv. \ No newline at end of file diff --git a/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification b/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification new file mode 100644 index 0000000000..312b4ae873 --- /dev/null +++ b/data/2024/aaai/High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification @@ -0,0 +1 @@ +Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network. This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at https://github.com/Jaulaucoeng/HOS-Net. \ No newline at end of file diff --git a/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction b/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction new file mode 100644 index 0000000000..72bcdfecc2 --- /dev/null +++ b/data/2024/aaai/High-Quality Real-Time Rendering Using Subpixel Sampling Reconstruction @@ -0,0 +1 @@ +Generating high-quality, realistic rendering images for real-time applications generally requires tracing a few samples-per-pixel (spp) and using deep learning-based approaches to denoise the resulting low-spp images. Existing denoising methods necessitate a substantial time expenditure when rendering at high resolutions due to the physically-based sampling and network inference time burdens. In this paper, we propose a novel Monte Carlo sampling strategy to accelerate the sampling process and a corresponding denoiser, subpixel sampling reconstruction (SSR), to obtain high-quality images. Extensive experiments demonstrate that our method significantly outperforms previous approaches in denoising quality and reduces overall time costs, enabling real-time rendering capabilities at 2K resolution. \ No newline at end of file diff --git a/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model b/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model new file mode 100644 index 0000000000..2c382a6bf3 --- /dev/null +++ b/data/2024/aaai/History Matters: Temporal Knowledge Editing in Large Language Model @@ -0,0 +1 @@ +The imperative task of revising or updating the knowledge stored within large language models arises from two distinct sources: intrinsic errors inherent in the model which should be corrected and outdated knowledge due to external shifts in the real world which should be updated. Prevailing efforts in model editing conflate these two distinct categories of edits arising from distinct reasons and directly modify the original knowledge in models into new knowledge. However, we argue that preserving the model's original knowledge remains pertinent. Specifically, if a model's knowledge becomes outdated due to evolving worldly dynamics, it should retain recollection of the historical knowledge while integrating the newfound knowledge. In this work, we introduce the task of Temporal Knowledge Editing (TKE) and establish a benchmark AToKe (Assessment of TempOral Knowledge Editing) to evaluate current model editing methods. We find that while existing model editing methods are effective at making models remember new knowledge, the edited model catastrophically forgets historical knowledge. To address this gap, we propose a simple and general framework termed Multi-Editing with Time Objective (METO) for enhancing existing editing models, which edits both historical and new knowledge concurrently and optimizes the model's prediction for the time of each fact. Our assessments demonstrate that while AToKe is still difficult, METO maintains the effectiveness of learning new knowledge and meanwhile substantially improves the performance of edited models on utilizing historical knowledge. \ No newline at end of file diff --git a/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering b/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering new file mode 100644 index 0000000000..4e99231cbc --- /dev/null +++ b/data/2024/aaai/Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering @@ -0,0 +1 @@ +Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor performance on heterophilous graphs is actually due to the fact that conventional graph neural networks (GNNs), which are essentially low-pass filters, discard information other than the low-frequency information on the graph. Nevertheless, on certain graphs, particularly heterophilous ones, neglecting high-frequency information and focusing solely on low-frequency information impedes the learning of node representations. To break this limitation, our motivation is to perform graph filtering that is closely related to the homophily degree of the given graph, with the aim of fully leveraging both low-frequency and high-frequency signals to learn distinguishable node embedding. In this work, we propose Adaptive Hybrid Graph Filter for Multi-View Graph Clustering (AHGFC). Specifically, a graph joint process and graph joint aggregation matrix are first designed by using the intrinsic node features and adjacency relationship, which makes the low and high-frequency signals on the graph more distinguishable. Then we design an adaptive hybrid graph filter that is related to the homophily degree, which learns the node embedding based on the graph joint aggregation matrix. After that, the node embedding of each view is weighted and fused into a consensus embedding for the downstream task. Experimental results show that our proposed model performs well on six datasets containing homophilous and heterophilous graphs. \ No newline at end of file diff --git a/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models b/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models new file mode 100644 index 0000000000..629780096e --- /dev/null +++ b/data/2024/aaai/Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models @@ -0,0 +1 @@ +Recently, Large Language Models (LLMs) have shown impressive abilities in code generation. However, existing LLMs' decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming languages (PL). Due to this oversight, a better decoding strategy for code generation remains an open question. In this paper, we conduct the first systematic study to explore a decoding strategy specialized in code generation. With an analysis of loss distributions of code tokens, we find that code tokens can be divided into two categories: challenging tokens that are difficult to predict and confident tokens that can be easily inferred. Among them, the challenging tokens mainly appear at the beginning of a code block. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling, which dynamically adjusts the temperature coefficient when decoding different tokens. We apply a larger temperature when sampling for challenging tokens, allowing LLMs to explore diverse choices. We employ a smaller temperature for confident tokens avoiding the influence of tail randomness noises. We apply AdapT sampling to LLMs with different sizes and conduct evaluations on two popular datasets. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy. \ No newline at end of file diff --git a/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes b/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes new file mode 100644 index 0000000000..448ab6235a --- /dev/null +++ b/data/2024/aaai/How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes @@ -0,0 +1 @@ +Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input of real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting. \ No newline at end of file diff --git a/data/2024/aaai/How to Evaluate Behavioral Models b/data/2024/aaai/How to Evaluate Behavioral Models new file mode 100644 index 0000000000..71a8383f1d --- /dev/null +++ b/data/2024/aaai/How to Evaluate Behavioral Models @@ -0,0 +1 @@ +Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub ``diagonal bounded Bregman divergences'', that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models. \ No newline at end of file diff --git a/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection b/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection new file mode 100644 index 0000000000..266719c68f --- /dev/null +++ b/data/2024/aaai/How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection @@ -0,0 +1 @@ +Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval \ No newline at end of file diff --git a/data/2024/aaai/How to Make Knockout Tournaments More Popular? b/data/2024/aaai/How to Make Knockout Tournaments More Popular? new file mode 100644 index 0000000000..b82ffe82d1 --- /dev/null +++ b/data/2024/aaai/How to Make Knockout Tournaments More Popular? @@ -0,0 +1 @@ +Given a mapping from a set of players to the leaves of a complete binary tree (called a seeding), a knockout tournament is conducted as follows: every round, every two players with a common parent compete against each other, and the winner is promoted to the common parent; then, the leaves are deleted. When only one player remains, it is declared the winner. This is a popular competition format in sports, elections, and decision-making. Over the past decade, it has been studied intensively from both theoretical and practical points of view. Most frequently, the objective is to seed the tournament in a way that ``assists'' (or even guarantees) some particular player to win the competition. We introduce a new objective, which is very sensible from the perspective of the directors of the competition: maximize the profit or popularity of the tournament. Specifically, we associate a ``score'' with every possible match, and aim to seed the tournament to maximize the sum of the scores of the matches that take place. We focus on the case where we assume a total order on the players' strengths, and provide a wide spectrum of results on the computational complexity of the problem. \ No newline at end of file diff --git a/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection? b/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection? new file mode 100644 index 0000000000..4df931f6f1 --- /dev/null +++ b/data/2024/aaai/How to Overcome Curse-of-Dimensionality for Out-of-Distribution Detection? @@ -0,0 +1 @@ +Machine learning models deployed in the wild can be challenged by out-of-distribution (OOD) data from unknown classes. Recent advances in OOD detection rely on distance measures to distinguish samples that are relatively far away from the in-distribution (ID) data. Despite the promise, distance-based methods can suffer from the curse-of-dimensionality problem, which limits the efficacy in high dimensional feature space. To combat this problem, we propose a novel framework, Subspace Nearest Neighbor (SNN), for OOD detection. In training, our method regularizes the model and its feature representation by leveraging the most relevant subset of dimensions (i.e. subspace). The subspace learning yields highly distinguishable distance measures between ID and OOD data. We provide comprehensive experiments and ablations to validate the efficacy of SNN. Compared to the current best distance-based method, SNN reduces the average FPR95 by 15.96% on the CIFAR-100 benchmark. \ No newline at end of file diff --git a/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models? b/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models? new file mode 100644 index 0000000000..38a14792f4 --- /dev/null +++ b/data/2024/aaai/How to Protect Copyright Data in Optimization of Large Language Models? @@ -0,0 +1,3 @@ +Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. + +In this paper, we observe that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data. \ No newline at end of file diff --git a/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation b/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation new file mode 100644 index 0000000000..5a38f880f9 --- /dev/null +++ b/data/2024/aaai/How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation @@ -0,0 +1 @@ +We observe two phenomenons with respect to quantity and capacity: 1) more teacher is not always better for multi-teacher knowledge distillation, and 2) stronger teacher is not always better for single-teacher knowledge distillation. To trade off the quantity and capacity of teacher ensemble, in this paper, we propose a new distillation paradigm named Dynamic Knowledge Distillation (DynaKD) that learn an adaptive categorical distribution to stochastically employ a teacher from a teacher ensemble in each step, to transfer knowledge from teacher ensemble into student. DynaKD has three advantages: 1) it can preserve diversity of each teacher via one-to-one distillation manner instead of several-for-one, 2) it can make the best of powerful teacher via those multi-level assistant teachers in ensemble, and 3) it can also dynamically determine the importance of each teacher for various tasks. To verify the effectiveness of the proposed approach, we conduct extensive experiments for BERT compression on GLUE benchmark. Experimental results show that the proposed approach achieves state-of-the-art score compared to previous compression approaches on five out of seven downstream tasks, including pushing MRPC F1 and accuracy to 92.2 (1.4 point absolute improvement), RTE accuracy to 76.2 (2.8 point absolute improvement). Moreover, we conduct also extensive experiments for image classification on CIFAR-100. Similarly, DynaKD achieves also state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization? b/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization? new file mode 100644 index 0000000000..1558cc0df1 --- /dev/null +++ b/data/2024/aaai/How to Use the Metropolis Algorithm for Multi-Objective Optimization? @@ -0,0 +1,7 @@ +The Metropolis algorithm can cope with local optima by accepting inferior solutions with suitably small probability. That this can work well was not only observed in empirical research, but also via mathematical runtime analyses on single-objective benchmarks. This paper takes several steps towards understanding, again via theoretical means, whether such advantages can also be obtained in multi-objective optimization. + +The original Metropolis algorithm has two components, one-bit mutation and the acceptance strategy, which allows accepting inferior solutions. When adjusting the acceptance strategy to multi-objective optimization in the way that an inferior solution that is accepted replaces its parent, then the Metropolis algorithm is not very efficient on our multi-objective version of the multimodal DLB benchmark called DLTB. With one-bit mutation, this multi-objective Metropolis algorithm cannot optimize the DLTB problem, with standard bit-wise mutation it needs at least Ω(n^5) time to cover the full Pareto front. In contrast, we show that many other multi-objective optimizers, namely the GSEMO, SMS-EMOA, and NSGA-II, only need time O(n^4). + +When keeping the parent when an inferior point is accepted, the multi-objective Metropolis algorithm both with one-bit or standard bit-wise mutation solves the DLTB problem efficiently, with one-bit mutation experimentally leading to better results than several other algorithms. + +Overall, our work suggests that the general mechanism of the Metropolis algorithm can be interesting in multi-objective optimization, but that the implementation details can have a huge impact on the performance. \ No newline at end of file diff --git a/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback b/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback new file mode 100644 index 0000000000..1ea2c8fa6f --- /dev/null +++ b/data/2024/aaai/HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback @@ -0,0 +1 @@ +We introduce HuTuMotion, an innovative approach for generating natural human motions that navigates latent motion diffusion models by leveraging few-shot human feedback. Unlike existing approaches that sample latent variables from a standard normal prior distribution, our method adapts the prior distribution to better suit the characteristics of the data, as indicated by human feedback, thus enhancing the quality of motion generation. Furthermore, our findings reveal that utilizing few-shot feedback can yield performance levels on par with those attained through extensive human feedback. This discovery emphasizes the potential and efficiency of incorporating few-shot human-guided optimization within latent diffusion models for personalized and style-aware human motion generation applications. The experimental results show the significantly superior performance of our method over existing state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games b/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games new file mode 100644 index 0000000000..b708ab5888 --- /dev/null +++ b/data/2024/aaai/Human-Guided Moral Decision Making in Text-Based Games @@ -0,0 +1 @@ +Training reinforcement learning (RL) agents to achieve desired goals while also acting morally is a challenging problem. Transformer-based language models (LMs) have shown some promise in moral awareness, but their use in different contexts is problematic because of the complexity and implicitness of human morality. In this paper, we build on text-based games, which are challenging environments for current RL agents, and propose the HuMAL (Human-guided Morality Awareness Learning) algorithm, which adaptively learns personal values through human-agent collaboration with minimal manual feedback. We evaluate HuMAL on the Jiminy Cricket benchmark, a set of text-based games with various scenes and dense morality annotations, using both simulated and actual human feedback. The experimental results demonstrate that with a small amount of human feedback, HuMAL can improve task performance and reduce immoral behavior in a variety of games and is adaptable to different personal values. \ No newline at end of file diff --git a/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking b/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking new file mode 100644 index 0000000000..0b7499083d --- /dev/null +++ b/data/2024/aaai/Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking @@ -0,0 +1 @@ +Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT. \ No newline at end of file diff --git a/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion b/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion new file mode 100644 index 0000000000..121561d54b --- /dev/null +++ b/data/2024/aaai/Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion @@ -0,0 +1,2 @@ +Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harness a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) +for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors. The code is available at https://github.com/RollingPlain/HSDS_MEF. \ No newline at end of file diff --git a/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations b/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations new file mode 100644 index 0000000000..3f272d49d3 --- /dev/null +++ b/data/2024/aaai/HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations @@ -0,0 +1 @@ +Existing gait recognition benchmarks mostly include minor clothing variations in the laboratory environments, but lack persistent changes in appearance over time and space. In this paper, we propose the first in-the-wild benchmark CCGait for cloth-changing gait recognition, which incorporates diverse clothing changes, indoor and outdoor scenes, and multi-modal statistics over 92 days. To further address the coupling effect of clothing and viewpoint variations, we propose a hybrid approach HybridGait that exploits both temporal dynamics and the projected 2D information of 3D human meshes. Specifically, we introduce a Canonical Alignment Spatial-Temporal Transformer (CA-STT) module to encode human joint position-aware features, and fully exploit 3D dense priors via a Silhouette-guided Deformation with 3D-2D Appearance Projection (SilD) strategy. Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes over expanded time and space, and we propose a hybrid framework HybridGait that outperforms prior works on CCGait and Gait3D benchmarks. Our project page is available at https://github.com/HCVLab/HybridGait. \ No newline at end of file diff --git a/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection b/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection new file mode 100644 index 0000000000..f55174e98e --- /dev/null +++ b/data/2024/aaai/Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection @@ -0,0 +1 @@ +Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a self-driving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects. \ No newline at end of file diff --git a/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract) b/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract) new file mode 100644 index 0000000000..fd551daf83 --- /dev/null +++ b/data/2024/aaai/HyperCube: Implicit Field Representations of Voxelized 3D Models (Student Abstract) @@ -0,0 +1 @@ +Implicit field representations offer an effective way of generating 3D object shapes. They leverage an implicit decoder (IM-NET) trained to take a 3D point coordinate concatenated with a shape encoding and to output a value indicating whether the point is outside the shape. This approach enables the efficient rendering of visually plausible objects but also has some significant limitations, resulting in a cumbersome training procedure and empty spaces within the rendered mesh. In this paper, we introduce a new HyperCube architecture based on interval arithmetic that enables direct processing of 3D voxels, trained using a hypernetwork paradigm to enforce model convergence. The code is available at https://github.com/mproszewska/hypercube. \ No newline at end of file diff --git a/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks b/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks new file mode 100644 index 0000000000..1c7b489bd8 --- /dev/null +++ b/data/2024/aaai/HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks @@ -0,0 +1 @@ +Editing real images authentically while also achieving cross-domain editing remains a challenge. Recent studies have focused on converting real images into latent codes and accomplishing image editing by manipulating these codes. However, merely manipulating the latent codes would constrain the edited images to the generator's image domain, hindering the attainment of diverse editing goals. In response, we propose an innovative image editing method called HyperEditor, which utilizes weight factors generated by hypernetworks to reassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's cross-modal image-text semantic alignment, this innovative approach enables us to simultaneously accomplish authentic attribute editing and cross-domain style transfer, a capability not realized in previous methods. Additionally, we ascertain that modifying only the weights of specific layers in the generator can yield an equivalent editing result. Therefore, we introduce an adaptive layer selector, enabling our hypernetworks to autonomously identify the layers requiring output weight factors, which can further improve our hypernetworks' efficiency. Extensive experiments on abundant challenging datasets demonstrate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/HyperFast: Instant Classification for Tabular Data b/data/2024/aaai/HyperFast: Instant Classification for Tabular Data new file mode 100644 index 0000000000..1878d98bc9 --- /dev/null +++ b/data/2024/aaai/HyperFast: Instant Classification for Tabular Data @@ -0,0 +1 @@ +Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at https://github.com/AI-sandbox/HyperFast. \ No newline at end of file diff --git a/data/2024/aaai/Hyperbolic Graph Diffusion Model b/data/2024/aaai/Hyperbolic Graph Diffusion Model new file mode 100644 index 0000000000..abef8206a4 --- /dev/null +++ b/data/2024/aaai/Hyperbolic Graph Diffusion Model @@ -0,0 +1 @@ +Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, real-world graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data in a Euclidean space, which leads to distortions and interferes with modeling the distribution. Instead, hyperbolic space has been found to be more suitable for capturing complex hierarchical structures due to its exponential growth property. In order to simultaneously utilize the data generation capabilities of diffusion models and the ability of hyperbolic embeddings to extract latent hierarchical distributions, we propose a novel graph generation method called, Hyperbolic Graph Diffusion Model (HGDM), which consists of an auto-encoder to encode nodes into successive hyperbolic embeddings, and a DM that operates in the hyperbolic latent space. HGDM captures the crucial graph structure distributions by constructing a hyperbolic potential node space that incorporates edge information. Extensive experiments show that HGDM achieves better performance in generic graph and molecule generation benchmarks, with a 48% improvement in the quality of graph generation with highly hierarchical structures. \ No newline at end of file diff --git a/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning b/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning new file mode 100644 index 0000000000..e599005ff1 --- /dev/null +++ b/data/2024/aaai/Hypercorrelation Evolution for Video Class-Incremental Learning @@ -0,0 +1 @@ +Video class-incremental learning aims to recognize new actions while restricting the catastrophic forgetting of old ones, whose representative samples can only be saved in limited memory. Semantically variable subactions are susceptible to class confusion due to data imbalance. While existing methods address the problem by estimating and distilling the spatio-temporal knowledge, we further explores that the refinement of hierarchical correlations is crucial for the alignment of spatio-temporal features. To enhance the adaptability on evolved actions, we proposes a hierarchical aggregation strategy, in which hierarchical matching matrices are combined and jointly optimized to selectively store and retrieve relevant features from previous tasks. Meanwhile, a correlation refinement mechanism is presented to reinforce the bias on informative exemplars according to online hypercorrelation distribution. Experimental results demonstrate the effectiveness of the proposed method on three standard video class-incremental learning benchmarks, outperforming state-of-the-art methods. Code is available at: https://github.com/Lsen991031/HCE \ No newline at end of file diff --git a/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion b/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion new file mode 100644 index 0000000000..b73e7a0925 --- /dev/null +++ b/data/2024/aaai/Hypergraph Joint Representation Learning for Hypervertices and Hyperedges via Cross Expansion @@ -0,0 +1 @@ +Hypergraph captures high-order information in structured data and obtains much attention in machine learning and data mining. Existing approaches mainly learn representations for hypervertices by transforming a hypergraph to a standard graph, or learn representations for hypervertices and hyperedges in separate spaces. In this paper, we propose a hypergraph expansion method to transform a hypergraph to a standard graph while preserving high-order information. Different from previous hypergraph expansion approaches like clique expansion and star expansion, we transform both hypervertices and hyperedges in the hypergraph to vertices in the expanded graph, and construct connections between hypervertices or hyperedges, so that richer relationships can be used in graph learning. Based on the expanded graph, we propose a learning model to embed hypervertices and hyperedges in a joint representation space. Compared with the method of learning separate spaces for hypervertices and hyperedges, our method is able to capture common knowledge involved in hypervertices and hyperedges, and also improve the data efficiency and computational efficiency. To better leverage structure information, we minimize the graph reconstruction loss to preserve the structure information in the model. We perform experiments on both hypervertex classification and hyperedge classification tasks to demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Hypergraph Neural Architecture Search b/data/2024/aaai/Hypergraph Neural Architecture Search new file mode 100644 index 0000000000..3e448b5fc0 --- /dev/null +++ b/data/2024/aaai/Hypergraph Neural Architecture Search @@ -0,0 +1 @@ +In recent years, Hypergraph Neural Networks (HGNNs) have achieved considerable success by manually designing architectures, which are capable of extracting effective patterns with high-order interactions from non-Euclidean data. However, such mechanism is extremely inefficient, demanding tremendous human efforts to tune diverse model parameters. In this paper, we propose a novel Hypergraph Neural Architecture Search (HyperNAS) to automatically design the optimal HGNNs. The proposed model constructs a search space suitable for hypergraphs, and derives hypergraph architectures through differentiable search strategies. A hypergraph structure-aware distance criterion is introduced as a guideline for obtaining an optimal hypergraph architecture via the leave-one-out method. Experimental results for node classification on benchmark Cora, Citeseer, Pubmed citation networks and hypergraph datasets show that HyperNAS outperforms existing HGNNs models and graph NAS methods. \ No newline at end of file diff --git a/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood b/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood new file mode 100644 index 0000000000..3eff765bfb --- /dev/null +++ b/data/2024/aaai/Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood @@ -0,0 +1 @@ +In supervised learning, automatically assessing the quality of the labels before any learning takes place remains an open research question. In certain particular cases, hypothesis testing procedures have been proposed to assess whether a given instance-label dataset is contaminated with class-conditional label noise, as opposed to uniform label noise. The existing theory builds on the asymptotic properties of the Maximum Likelihood Estimate for parametric logistic regression. However, the parametric assumptions on top of which these approaches are constructed are often too strong and unrealistic in practice. To alleviate this problem, in this paper we propose an alternative path by showing how similar procedures can be followed when the underlying model is a product of Local Maximum Likelihood Estimation that leads to more flexible nonparametric logistic regression models, which in turn are less susceptible to model misspecification. This different view allows for wider applicability of the tests by offering users access to a richer model class. Similarly to existing works, we assume we have access to anchor points which are provided by the users. We introduce the necessary ingredients for the adaptation of the hypothesis tests to the case of nonparametric logistic regression and empirically compare against the parametric approach presenting both synthetic and real-world case studies and discussing the advantages and limitations of the proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning b/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning new file mode 100644 index 0000000000..5aac0d91ee --- /dev/null +++ b/data/2024/aaai/Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning @@ -0,0 +1 @@ +Large language models (LLMs) show their powerful automatic reasoning and planning capability with a wealth of semantic knowledge about the human world. However, the grounding problem still hinders the applications of LLMs in the real-world environment. Existing studies try to fine-tune the LLM or utilize pre-defined behavior APIs to bridge the LLMs and the environment, which not only costs huge human efforts to customize for every single task but also weakens the generality strengths of LLMs. To autonomously ground the LLM onto the environment, we proposed the Hypothesis, Verification, and Induction (HYVIN) framework to automatically and progressively ground the LLM with self-driven skill learning. HYVIN first employs the LLM to propose the hypothesis of sub-goals to achieve tasks and then verify the feasibility of the hypothesis via interacting with the underlying environment. Once verified, HYVIN can then learn generalized skills with the guidance of these successfully grounded subgoals. These skills can be further utilized to accomplish more complex tasks that fail to pass the verification phase. Verified in the famous instruction following task set, BabyAI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework. \ No newline at end of file diff --git a/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives b/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives new file mode 100644 index 0000000000..c1e819182f --- /dev/null +++ b/data/2024/aaai/I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives @@ -0,0 +1 @@ +The open streets initiative "opens" streets to pedestrians and bicyclists by closing them to cars and trucks. The initiative, adopted by many cities across North America, increases community space in urban environments. But could open streets also make cities safer and less congested? We study this question by framing the choice of which streets to open as a reinforcement learning problem. In order to simulate the impact of opening streets, we first compare models for predicting vehicle collisions given network and temporal data. We find that a recurrent graph neural network, leveraging the graph structure and the short-term temporal dependence of the data, gives the best predictive performance. Then, with the ability to simulate collisions and traffic, we frame a reinforcement learning problem to find which streets to open. We compare the streets in the open streets initiative to those proposed by a Q-learning algorithm. We find that the streets proposed by the Q-learning algorithm have reliably better outcomes, while streets already selected by the open streets initiative have similar outcomes to randomly selected streets. We present our work as a step toward principally choosing which streets to open for safer and less congested cities. \ No newline at end of file diff --git a/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data b/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data new file mode 100644 index 0000000000..4f8ffcef9e --- /dev/null +++ b/data/2024/aaai/I Prefer Not to Say: Protecting User Consent in Models with Optional Personal Data @@ -0,0 +1 @@ +We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models. \ No newline at end of file diff --git a/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise b/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise new file mode 100644 index 0000000000..fd3691ca22 --- /dev/null +++ b/data/2024/aaai/I-CEE: Tailoring Explanations of Image Classification Models to User Expertise @@ -0,0 +1 @@ +Effectively explaining decisions of black-box machine learning models is critical to responsible deployment of AI systems that rely on them. Recognizing their importance, the field of explainable AI (XAI) provides several techniques to generate these explanations. Yet, there is relatively little emphasis on the user (the explainee) in this growing body of work and most XAI techniques generate "one-size-fits-all'' explanations. To bridge this gap and achieve a step closer towards human-centered XAI, we present I-CEE, a framework that provides Image Classification Explanations tailored to User Expertise. Informed by existing work, I-CEE explains the decisions of image classification models by providing the user with an informative subset of training data (i.e., example images), corresponding local explanations, and model decisions. However, unlike prior work, I-CEE models the informativeness of the example images to depend on user expertise, resulting in different examples for different users. We posit that by tailoring the example set to user expertise, I-CEE can better facilitate users' understanding and simulatability of the model. To evaluate our approach, we conduct detailed experiments in both simulation and with human participants (N = 100) on multiple datasets. Experiments with simulated users show that I-CEE improves users' ability to accurately predict the model's decisions (simulatability) compared to baselines, providing promising preliminary results. Experiments with human participants demonstrate that our method significantly improves user simulatability accuracy, highlighting the importance of human-centered XAI. \ No newline at end of file diff --git a/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment b/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment new file mode 100644 index 0000000000..b12f3074c2 --- /dev/null +++ b/data/2024/aaai/IBCA: An Intelligent Platform for Social Insurance Benefit Qualification Status Assessment @@ -0,0 +1 @@ +Social insurance benefits qualification assessment is an important task to ensure that retirees enjoy their benefits according to the regulations. It also plays a key role in curbing social security frauds. In this paper, we report the deployment of the Intelligent Benefit Certification and Analysis (IBCA) platform, an AI-empowered platform for verifying the status of retirees to ensure proper dispursement of funds in Shandong province, China. Based on an improved Gated Recurrent Unit (GRU) neural network, IBCA aggregates missing value interpolation, temporal information, and global and local feature extraction to perform accurate retiree survival rate prediction. Based on the predicted results, a reliability assessment mechanism based on Variational Auto-Encoder (VAE) and Monte-Carlo Dropout (MC Dropout) is executed to perform reliability assessment. Deployed since November 2019, the IBCA platform has been adopted by 12 cities across the Shandong province, handling over 50 terabytes of data. It has empowered human resources and social services, civil affairs, and health care institutions to collaboratively provide high-quality public services. Under the IBCA platform, the efficiency of resources utilization as well as the accuracy of benefit qualification assessment have been significantly improved. It has helped Dareway Software Co. Ltd earn over RMB 50 million of revenue. \ No newline at end of file diff --git a/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning b/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning new file mode 100644 index 0000000000..192c009dd3 --- /dev/null +++ b/data/2024/aaai/ICAR: Image-Based Complementary Auto Reasoning @@ -0,0 +1 @@ +Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual ``scene-based set compatibility reasoning'' with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a ``Flexible Bidirectional Transformer (FBT),'' consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively. \ No newline at end of file diff --git a/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity b/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity new file mode 100644 index 0000000000..33c321f4fc --- /dev/null +++ b/data/2024/aaai/IGAMT: Privacy-Preserving Electronic Health Record Synthesization with Heterogeneity and Irregularity @@ -0,0 +1 @@ +Integrating electronic health records (EHR) into machine learning-driven clinical research and hospital applications is important, as it harnesses extensive and high-quality patient data to enhance outcome predictions and treatment personalization. Nonetheless, due to privacy and security concerns, the secondary purpose of EHR data is consistently governed and regulated, primarily for research intentions, thereby constraining researchers' access to EHR data. Generating synthetic EHR data with deep learning methods is a viable and promising approach to mitigate privacy concerns, offering not only a supplementary resource for downstream applications but also sidestepping the confidentiality risks associated with real patient data. While prior efforts have concentrated on EHR data synthesis, significant challenges persist in the domain of generating synthetic EHR data: balancing the heterogeneity of real EHR including temporal and non-temporal features, addressing the missing values and irregular measures, and ensuring the privacy of the real data used for model training. Existing works in this domain only focused on solving one or two aforementioned challenges. In this work, we propose IGAMT, an innovative framework to generate privacy-preserved synthetic EHR data that not only maintain high quality with heterogeneous features, missing values, and irregular measures but also balances the privacy-utility trade-off. Extensive experiments prove that IGAMT significantly outperforms baseline architectures in terms of visual resemblance and comparable performance in downstream applications. Ablation case studies also prove the effectiveness of the techniques applied in IGAMT. \ No newline at end of file diff --git a/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching b/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching new file mode 100644 index 0000000000..b2029266dd --- /dev/null +++ b/data/2024/aaai/IINet: Implicit Intra-inter Information Fusion for Real-Time Stereo Matching @@ -0,0 +1 @@ +Recently, there has been a growing interest in 3D CNN-based stereo matching methods due to their remarkable accuracy. However, the high complexity of 3D convolution makes it challenging to strike a balance between accuracy and speed. Notably, explicit 3D volumes contain considerable redundancy. In this study, we delve into more compact 2D implicit network to eliminate redundancy and boost real-time performance. However, simply replacing explicit 3D networks with 2D implicit networks causes issues that can lead to performance degradation, including the loss of structural information, the quality decline of inter-image information, as well as the inaccurate regression caused by low-level features. To address these issues, we first integrate intra-image information to fuse with inter-image information, facilitating propagation guided by structural cues. Subsequently, we introduce the Fast Multi-scale Score Volume (FMSV) and Confidence Based Filtering (CBF) to efficiently acquire accurate multi-scale, noise-free inter-image information. Furthermore, combined with the Residual Context-aware Upsampler (RCU), our Intra-Inter Fusing network is meticulously designed to enhance information transmission on both feature-level and disparity-level, thereby enabling accurate and robust regression. Experimental results affirm the superiority of our network in terms of both speed and accuracy compared to all other fast methods. \ No newline at end of file diff --git a/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems b/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems new file mode 100644 index 0000000000..6b1079211b --- /dev/null +++ b/data/2024/aaai/INFORMEDQX: Informed Conflict Detection for Over-Constrained Problems @@ -0,0 +1 @@ +Conflict detection is relevant in various application scenarios, ranging from interactive decision-making to the diagnosis of faulty knowledge bases. Conflicts can be regarded as sets of constraints that cause an inconsistency. In many scenarios (e.g., constraint-based configuration), conflicts are repeatedly determined for the same or similar sets of constraints. This misses out on the valuable opportunity for leveraging knowledge reuse and related potential performance improvements, which are extremely important, specifically interactive constraint-based applications. In this paper, we show how to integrate knowledge reuse concepts into non-instructive conflict detection. We introduce the InformedQX algorithm, which is a reuse-aware variant of QuickXPlain. The results of a related performance analysis with the Linux-2.6.3.33 configuration knowledge base show significant improvements in terms of runtime performance compared to QuickXPlain. \ No newline at end of file diff --git a/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples b/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples new file mode 100644 index 0000000000..3e2941b745 --- /dev/null +++ b/data/2024/aaai/IOFM: Using the Interpolation Technique on the Over-Fitted Models to Identify Clean-Annotated Samples @@ -0,0 +1 @@ +Most recent state-of-the-art algorithms for handling noisy label problems are based on the memorization effect, which is a phenomenon that deep neural networks (DNNs) memorize clean data before noisy ones. While the memorization effect can be a powerful tool, there are several cases where memorization effect does not occur. Examples are imbalanced class distributions and heavy contamination on labels. To address this limitation, we introduce a whole new approach called the interpolation with the over-fitted model (IOFM), which leverages over-fitted deep neural networks. The IOFM utilizes a new finding of over-fitted DNNs: for a given training sample, its neighborhoods chosen from the feature space are distributed differently on the original input space depending on the cleanness of the target sample. The IOFM has notable features in two aspects: 1) it yields superior results even when the training data are imbalanced or heavily noisy, 2) since we utilize over-fitted deep neural networks, a fine-tuning procedure to select the optimal training epoch, which is an essential yet sensitive factor for the success of the memorization effect, is not required, and thus, the IOFM can be used for non-experts. Through extensive experiments, we show that our method can serve as a promising alternative to existing solutions dealing with noisy labels, offering improved performance even in challenging situations. \ No newline at end of file diff --git a/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking b/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking new file mode 100644 index 0000000000..4b3d0a7cff --- /dev/null +++ b/data/2024/aaai/IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking @@ -0,0 +1 @@ +Training Deep Neural Networks (DNNs) can be expensive when data is difficult to obtain or labeling them requires significant domain expertise. Hence, it is crucial that the Intellectual Property (IP) of DNNs trained on valuable data be protected against IP infringement. DNN fingerprinting and watermarking are two lines of work in DNN IP protection. Recently proposed DNN fingerprinting techniques are able to detect IP infringement while preserving model performance by relying on the key assumption that the decision boundaries of independently trained models are intrinsically different from one another. In contrast, DNN watermarking embeds a watermark in a model and verifies IP infringement if an identical or similar watermark is extracted from a suspect model. The techniques deployed in fingerprinting and watermarking vary significantly because their underlying mechanisms are different. From an adversary's perspective, a successful IP removal attack should defeat both fingerprinting and watermarking. However, to the best of our knowledge, there is no work on such attacks in the literature yet. In this paper, we fill this gap by presenting an IP removal attack that can defeat both fingerprinting and watermarking. We consider the challenging data-free scenario whereby all data is inverted from the victim model. Under this setting, a stolen model only depends on the victim model. Experimental results demonstrate the success of our attack in defeating state-of-the-art DNN fingerprinting and watermarking techniques. This work reveals a novel attack surface that exploits generative model inversion attacks to bypass DNN IP defenses. This threat must be addressed by future defenses for reliable IP protection. \ No newline at end of file diff --git a/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning b/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning new file mode 100644 index 0000000000..af7d8ad172 --- /dev/null +++ b/data/2024/aaai/IRPruneDet: Efficient Infrared Small Target Detection via Wavelet Structure-Regularized Soft Channel Pruning @@ -0,0 +1 @@ +Infrared Small Target Detection (IRSTD) refers to detecting faint targets in infrared images, which has achieved notable progress with the advent of deep learning. However, the drive for improved detection accuracy has led to larger, intricate models with redundant parameters, causing storage and computation inefficiencies. In this pioneering study, we introduce the concept of utilizing network pruning to enhance the efficiency of IRSTD. Due to the challenge posed by low signal-to-noise ratios and the absence of detailed semantic information in infrared images, directly applying existing pruning techniques yields suboptimal performance. To address this, we propose a novel wavelet structure-regularized soft channel pruning method, giving rise to the efficient IRPruneDet model. Our approach involves representing the weight matrix in the wavelet domain and formulating a wavelet channel pruning strategy. We incorporate wavelet regularization to induce structural sparsity without incurring extra memory usage. Moreover, we design a soft channel reconstruction method that preserves important target information against premature pruning, thereby ensuring an optimal sparse structure while maintaining overall sparsity. Through extensive experiments on two widely-used benchmarks, our IRPruneDet method surpasses established techniques in both model complexity and accuracy. Specifically, when employing U-net as the baseline network, IRPruneDet achieves a 64.13% reduction in parameters and a 51.19% decrease in FLOPS, while improving IoU from 73.31% to 75.12% and nIoU from 70.92% to 74.30%. The code is available at https://github.com/hd0013/IRPruneDet. \ No newline at end of file diff --git a/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection b/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection new file mode 100644 index 0000000000..fc092e560d --- /dev/null +++ b/data/2024/aaai/ISP-Teacher: Image Signal Process with Disentanglement Regularization for Unsupervised Domain Adaptive Dark Object Detection @@ -0,0 +1 @@ +Object detection in dark conditions has always been a great challenge due to the complex formation process of low-light images. Currently, the mainstream methods usually adopt domain adaptation with Teacher-Student architecture to solve the dark object detection problem, and they imitate the dark conditions by using non-learnable data augmentation strategies on the annotated source daytime images. Note that these methods neglected to model the intrinsic imaging process, i.e. image signal processing (ISP), which is important for camera sensors to generate low-light images. To solve the above problems, in this paper, we propose a novel method named ISP-Teacher for dark object detection by exploring Teacher-Student architecture from a new perspective (i.e. self-supervised learning based ISP degradation). Specifically, we first design a day-to-night transformation module that consistent with the ISP pipeline of the camera sensors (ISP-DTM) to make the augmented images look more in line with the natural low-light images captured by cameras, and the ISP-related parameters are learned in a self-supervised manner. Moreover, to avoid the conflict between the ISP degradation and detection tasks in a shared encoder, we propose a disentanglement regularization (DR) that minimizes the absolute value of cosine similarity to disentangle two tasks and push two gradients vectors as orthogonal as possible. Extensive experiments conducted on two benchmarks show the effectiveness of our method in dark object detection. In particular, ISP-Teacher achieves an improvement of +2.4% AP and +3.3% AP over the SOTA method on BDD100k and SHIFT datasets, respectively. The code can be found at https://github.com/zhangyin1996/ISP-Teacher. \ No newline at end of file diff --git a/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis b/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis new file mode 100644 index 0000000000..592d0f2514 --- /dev/null +++ b/data/2024/aaai/IT3D: Improved Text-to-3D Generation with Explicit View Synthesis @@ -0,0 +1 @@ +Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches. \ No newline at end of file diff --git a/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs b/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs new file mode 100644 index 0000000000..246146ac3a --- /dev/null +++ b/data/2024/aaai/Identifiability of Direct Effects from Summary Causal Graphs @@ -0,0 +1,2 @@ +Dynamic structural causal models (SCMs) are a powerful framework for reasoning in dynamic systems about direct effects which measure how a change in one variable affects another variable while holding all other variables constant. The causal relations in a dynamic structural causal model can be qualitatively represented with an acyclic full-time causal graph. Assuming linearity and no hidden confounding and given the full-time causal graph, the direct causal effect is always identifiable. However, in many application such a graph is not available for various reasons but nevertheless experts have access to the summary causal graph of the full-time causal graph which represents causal relations between time series while omitting temporal information and allowing cycles. This paper presents a complete identifiability result which characterizes all cases for which the direct effect +is graphically identifiable from a summary causal graph and gives two sound finite adjustment sets that can be used to estimate the direct effect whenever it is identifiable. \ No newline at end of file diff --git a/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time b/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time new file mode 100644 index 0000000000..22c0e7bd52 --- /dev/null +++ b/data/2024/aaai/Identification for Tree-Shaped Structural Causal Models in Polynomial Time @@ -0,0 +1 @@ +Linear structural causal models (SCMs) are used to express and analyze the relationships between random variables. Direct causal effects are represented as directed edges and confounding factors as bidirected edges. Identifying the causal parameters from correlations between the nodes is an open problem in artificial intelligence. In this paper, we study SCMs whose directed component forms a tree. Van der Zander et al. give a PSPACE-algorithm for the identification problem in this case, which is a significant improvement over the general Gröbner basis approach, which has doubly-exponential time complexity in the number of structural parameters. However, they do not show that their algorithm is complete. In this work, we present a randomized polynomial-time algorithm, which solves the identification problem for tree-shaped SCMs. For every structural parameter, our algorithms decides whether it is generically identifiable, generically 2-identifiable, or generically unidentifiable. (No other cases can occur.) In the first two cases, it provides one or two fractional affine square root terms of polynomials (FASTPs) for the corresponding parameter, respectively. In particular, our algorithm is not only polynomial time, but also complete for for tree-shaped SCMs. \ No newline at end of file diff --git a/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model b/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model new file mode 100644 index 0000000000..b936ec4bb8 --- /dev/null +++ b/data/2024/aaai/Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model @@ -0,0 +1,3 @@ +Missing data are an unavoidable complication frequently encountered in many causal discovery tasks. +When a missing process depends on the missing values themselves (known as self-masking missingness), the recovery of the joint distribution becomes unattainable, and detecting the presence of such self-masking missingness remains a perplexing challenge. Consequently, due to the inability to reconstruct the original distribution and to discern the underlying missingness mechanism, simply applying existing causal discovery methods would lead to wrong conclusions. In this work, we found that the recent advances additive noise model has the potential for learning causal structure under the existence of the self-masking missingness. With this observation, we aim to investigate the identification problem of learning causal structure from missing data under an additive noise model with different missingness mechanisms, where the `no self-masking missingness' assumption can be eliminated appropriately. +Specifically, we first elegantly extend the scope of identifiability of causal skeleton to the case with weak self-masking missingness (i.e., no other variable could be the cause of self-masking indicators except itself). We further provide the sufficient and necessary identification conditions of the causal direction under additive noise model and show that the causal structure can be identified up to an IN-equivalent pattern. We finally propose a practical algorithm based on the above theoretical results on learning the causal skeleton and causal direction. Extensive experiments on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants b/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants new file mode 100644 index 0000000000..573baceb25 --- /dev/null +++ b/data/2024/aaai/Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants @@ -0,0 +1 @@ +Causal discovery with latent variables is a crucial but challenging task. Despite the emergence of numerous methods aimed at addressing this challenge, they are not fully identified to the structure that two observed variables are influenced by one latent variable and there might be a directed edge in between. Interestingly, we notice that this structure can be identified through the utilization of higher-order cumulants. By leveraging the higher-order cumulants of non-Gaussian data, we provide an analytical solution for estimating the causal coefficients or their ratios. With the estimated (ratios of) causal coefficients, we propose a novel approach to identify the existence of a causal edge between two observed variables subject to latent variable influence. In case when such a causal edge exits, we introduce an asymmetry criterion to determine the causal direction. The experimental results demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching b/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching new file mode 100644 index 0000000000..272468684f --- /dev/null +++ b/data/2024/aaai/Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching @@ -0,0 +1 @@ +Image-text matching bridges vision and language, which is a fundamental task in multimodal intelligence. Its key challenge lies in how to capture visual-semantic relevance. Fine-grained semantic interactions come from fragment alignments between image regions and text words. However, not all fragments contribute to image-text relevance, and many existing methods are devoted to mining the vital ones to measure the relevance accurately. How well image and text relate depends on the degree of semantic sharing between them. Treating the degree as an effect and fragments as its possible causes, we define those indispensable causes for the generation of the degree as necessary undertakers, i.e., if any of them did not occur, the relevance would be no longer valid. In this paper, we revisit image-text matching in the causal view and uncover inherent causal properties of relevance generation. Then we propose a novel theoretical prototype for estimating the probability-of-necessity of fragments, PN_f, for the degree of semantic sharing by means of causal inference, and further design a Necessary Undertaker Identification Framework (NUIF) for image-text matching, which explicitly formalizes the fragment's contribution to image-text relevance by modeling PN_f in two ways. Extensive experiments show our method achieves state-of-the-art on benchmarks Flickr30K and MSCOCO. \ No newline at end of file diff --git a/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War b/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War new file mode 100644 index 0000000000..106e466f18 --- /dev/null +++ b/data/2024/aaai/Identifying Guarantors of War Veterans Using Robust-SEAL: A Case of the Korean War @@ -0,0 +1 @@ +Most countries provide veterans with various benefits to reward their sacrifice. Unfortunately, many veterans have failed to prove their status due to loss of military records. Thus, some governments allow the verification of those veterans through "buddy statements" obtained from the people who can vouch for the buddy's participation in the war. However, it is still challenging for veterans to find guarantors directly. With this background, we suggest to utilizing historical war records of combined operations to increase the pool of potential guarantors for the buddy statements. However, a combined operation network among troops can have missing edges and perturbations on attributes of the troop due to inaccurate information. In this study, we learn from some recorded interactions which might be incomplete and noisy, and predict missing linkages among the troops that might have interacted together in the war, by proposing Robust-SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction). It combines two Graph Neural Network (GNN) architectures: robust Graph Convolutional Network which considers the uncertainty of node attributes with a probabilistic approach, and SEAL which improves the expressive power of the GNN with a labeling trick. Our proposed approach was applied to Korean War data with perturbations. For experimentations, we hid some actual interactions and found that Robust-SEAL restores missing interactions better than other GNN-based baselines. \ No newline at end of file diff --git a/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach b/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach new file mode 100644 index 0000000000..326b7c3127 --- /dev/null +++ b/data/2024/aaai/Identifying Reasons for Bias: An Argumentation-Based Approach @@ -0,0 +1 @@ +As algorithmic decision-making systems become more prevalent in society, ensuring the fairness of these systems is becoming increasingly important. Whilst there has been substantial research in building fair algorithmic decision-making systems, the majority of these methods require access to the training data, including personal characteristics, and are not transparent regarding which individuals are classified unfairly. In this paper, we propose a novel model-agnostic argumentation-based method to determine why an individual is classified differently in comparison to similar individuals. Our method uses a quantitative argumentation framework to represent attribute-value pairs of an individual and of those similar to them, and uses a well-known semantics to identify the attribute-value pairs in the individual contributing most to their different classification. We evaluate our method on two datasets commonly used in the fairness literature and illustrate its effectiveness in the identification of bias. \ No newline at end of file diff --git a/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling b/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling new file mode 100644 index 0000000000..f131868e39 --- /dev/null +++ b/data/2024/aaai/Identifying and Addressing Disparities in Public Libraries with Bayesian Latent Variable Modeling @@ -0,0 +1,3 @@ +Public libraries are an essential public good. We ask: are urban library systems providing equitable service to all residents, in terms of the books they have access to and check out? If not, what causes disparities: heterogeneous book collections, resident behavior and access, and/or operational policies? Existing methods leverage only system-level outcome data (such as overall checkouts per branch), and so cannot distinguish between these factors. As a result, it is difficult to use their results to guide interventions to increase equitable access. We propose a Bayesian framework to characterize book checkout behavior across multiple branches of a library system, learning heterogeneous book popularity, overall branch demand, and usage of the online hold system, while controlling for book availability. + +In collaboration with the New York Public Library, we apply our framework to granular data consisting of over 400,000 checkouts during 2022. We first show that our model significantly out-performs baseline methods in predicting checkouts at the book-branch level. Next, we study spatial and socioeconomic disparities. We show that disparities are largely driven by disparate use of the online holds system, which allows library patrons to receive books from any other branch through an online portal. This system thus leads to a large outflow of popular books from branches in lower income neighborhoods to those in high income ones. Finally, we illustrate the use of our model and insights to quantify the impact of potential interventions, such as changing how books are internally routed between branches to fulfill hold requests. \ No newline at end of file diff --git a/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions b/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions new file mode 100644 index 0000000000..cd8e077b59 --- /dev/null +++ b/data/2024/aaai/Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions @@ -0,0 +1 @@ +Today's machine learning (ML) applications predominantly adhere to a standard paradigm: the decision maker designs the algorithm by optimizing a model for some objective function. While this has proven to be a powerful approach in many domains, it comes with inherent side effects: the power over the algorithmic outcomes lies solely in the hands of the algorithm designer, and alternative objectives, such as fairness, are often disregarded. This is particularly problematic if the algorithm is used to make consequential decisions that affect peoples lives. My research focuses on developing principled methods to characterize and address the mismatch between these different objectives. \ No newline at end of file diff --git a/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data b/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data new file mode 100644 index 0000000000..0329ad4783 --- /dev/null +++ b/data/2024/aaai/Image Captioning with Multi-Context Synthetic Data @@ -0,0 +1 @@ +Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps. \ No newline at end of file diff --git a/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually b/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually new file mode 100644 index 0000000000..3e43277c66 --- /dev/null +++ b/data/2024/aaai/Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually @@ -0,0 +1 @@ +Social media platforms are being increasingly used by malicious actors to share unsafe content, such as images depicting sexual activity, cyberbullying, and self-harm. Consequently, major platforms use artificial intelligence (AI) and human moderation to obfuscate such images to make them safer. Two critical needs for obfuscating unsafe images is that an accurate rationale for obfuscating image regions must be provided, and the sensitive regions should be obfuscated (e.g. blurring) for users' safety. This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions. In this work, we address these key issues by first performing visual reasoning by designing a visual reasoning model (VLM) conditioned on pre-trained unsafe image classifiers to provide an accurate rationale grounded in unsafe image attributes, and then proposing a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions for safe viewing, by first utilizing an unsafe image classifier attribution matrix to guide segmentation for a more optimal subregion segmentation followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier's output based on attribution score. Extensive experiments on uncurated data from social networks emphasize the efficacy of our proposed method. We make our code available at: https://github.com/SecureAIAutonomyLab/ConditionalVLM \ No newline at end of file diff --git a/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network b/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network new file mode 100644 index 0000000000..a4de350f14 --- /dev/null +++ b/data/2024/aaai/Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network @@ -0,0 +1 @@ +Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet. \ No newline at end of file diff --git a/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment b/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment new file mode 100644 index 0000000000..b06de009e6 --- /dev/null +++ b/data/2024/aaai/ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment @@ -0,0 +1,2 @@ +Most pre-trained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image +captioning. Despite the significant effort in this direction, we observed that existing metrics lack consistency in the inclusion of the visual signal. In this paper, we introduce a new bias assessment metric, dubbed ImageCaptioner2, for image captioning. Instead of measuring the absolute bias in the model or the data, ImageCaptioner2pay more attention to the bias introduced by the model w.r.t the data bias, termed bias amplification. Unlike the existing methods, which only evaluate the image captioning algorithms based on the generated captions only, ImageCaptioner2incorporates the image while measuring the bias. In addition, we design a formulation for measuring the bias of generated captions as prompt-based image captioning instead of using language classifiers. Finally, we apply our ImageCaptioner2metric across 11 different image captioning architectures on three different datasets, i.e., MS-COCO caption dataset, Artemis V1, and Artemis V2, and on three different protected attributes, i.e., gender, race, and emotions. Consequently, we verify the effectiveness of our ImageCaptioner2metric by proposing Anonymous-Bench, which is a novel human evaluation paradigm for bias metrics. Our metric shows significant superiority over the recent bias metric; LIC, in terms of human alignment, where the correlation scores are 80% and 54% for our metric and LIC, respectively. The code and more details are available at https://eslambakr.github.io/imagecaptioner2.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons b/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons new file mode 100644 index 0000000000..d3e8beeaf3 --- /dev/null +++ b/data/2024/aaai/ImageSTEAM: Teacher Professional Development for Integrating Visual Computing into Middle School Lessons @@ -0,0 +1 @@ +Artificial intelligence (AI) and its teaching in the K-12 grades has been championed as a vital need for the United States due to the technology's future prominence in the 21st century. However, there remain several barriers to effective AI lessons at these age groups including the broad range of interdisciplinary knowledge needed and the lack of formal training or preparation for teachers to implement these lessons. In this experience report, we present ImageSTEAM, a teacher professional development for creating lessons surrounding computer vision, machine learning, and computational photography/cameras targeted for middle school grades 6-8 classes. Teacher professional development workshops were conducted in the states of Arizona and Georgia from 2021-2023 where lessons were co-created with teachers to introduce various specific visual computing concepts while aligning to state and national standards. In addition, the use of a variety of computer vision and image processing software including custom designed Python notebooks were created as technology activities and demonstrations to be used in the classroom. Educational research showed that teachers improved their self-efficacy and outcomes for concepts in computer vision, machine learning, and artificial intelligence when participating in the program. Results from the professional development workshops highlight key opportunities and challenges in integrating this content into the standard curriculum, the benefits of a co-creation pedagogy, and the positive impact on teacher and student's learning experiences. The open-source program curriculum is available at www.imagesteam.org. \ No newline at end of file diff --git a/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning b/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..33cf893873 --- /dev/null +++ b/data/2024/aaai/Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models. \ No newline at end of file diff --git a/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design b/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design new file mode 100644 index 0000000000..54d3613294 --- /dev/null +++ b/data/2024/aaai/Imitation of Life: A Search Engine for Biologically Inspired Design @@ -0,0 +1,3 @@ +Biologically Inspired Design (BID), or Biomimicry, is a problem-solving methodology that applies analogies from nature to solve engineering challenges. For example, Speedo engineers designed swimsuits based on shark skin. Finding relevant biological solutions for real-world problems poses significant challenges, both due to the limited biological knowledge engineers and designers typically possess and to the limited BID resources. Existing BID datasets are hand-curated and small, and scaling them up requires costly human annotations. + +In this paper, we introduce BARcode (Biological Analogy Retriever), a search engine for automatically mining bio-inspirations from the web at scale. Using advances in natural language understanding and data programming, BARcode identifies potential inspirations for engineering challenges. Our experiments demonstrate that BARcode can retrieve inspirations that are valuable to engineers and designers tackling real-world problems, as well as recover famous historical BID examples. We release data and code; we view BARcode as a step towards addressing the challenges that have historically hindered the practical application of BID to engineering innovation. \ No newline at end of file diff --git a/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization b/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization new file mode 100644 index 0000000000..9f167f1e76 --- /dev/null +++ b/data/2024/aaai/Impartial Adversarial Distillation: Addressing Biased Data-Free Knowledge Distillation via Adaptive Constrained Optimization @@ -0,0 +1 @@ +Data-Free Knowledge Distillation (DFKD) enables knowledge transfer from a pretrained teacher to a light-weighted student without original training data. Existing works are limited by a strong assumption that samples used to pretrain the teacher model are balanced, which is, however, unrealistic for many real-world tasks. In this work, we investigated a pragmatic yet under-explored problem: how to perform DFKD from a teacher model pretrained from imbalanced data. We observe a seemingly counter-intuitive phenomenon, i.e., adversarial DFKD algorithms favour minority classes, while causing a disastrous impact on majority classes. We theoretically prove that a biased teacher could cause severe disparity on different groups of synthetic data in adversarial distillation, which further exacerbates the mode collapse of a generator and consequently degenerates the overall accuracy of a distilled student model. To tackle this problem, we propose a class-adaptive regularization method, aiming to encourage impartial representation learning of a generator among different classes under a constrained learning formulation. We devise a primal-dual algorithm to solve the target optimization problem. Through extensive experiments, we show that our method mitigates the biased learning of majority classes in DFKD and improves the overall performance compared with baselines. Code will be available at https://github.com/ldpbuaa/ipad. \ No newline at end of file diff --git a/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps b/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps new file mode 100644 index 0000000000..849a3173f2 --- /dev/null +++ b/data/2024/aaai/Implications of Distance over Redistricting Maps: Central and Outlier Maps @@ -0,0 +1 @@ +In representative democracy, a redistricting map is chosen to partition an electorate into districts which each elects a representative. A valid redistricting map must satisfy a collection of constraints such as being compact, contiguous, and of almost-equal population. However, these constraints are loose enough to enable an enormous ensemble of valid redistricting maps. This enables a partisan legislature to gerrymander by choosing a map which unfairly favors it. In this paper, we introduce an interpretable and tractable distance measure over redistricting maps which does not use election results and study its implications over the ensemble of redistricting maps. Specifically, we define a central map which may be considered "most typical" and give a rigorous justification for it by showing that it mirrors the Kemeny ranking in a scenario where we have a committee voting over a collection of redistricting maps to be drawn. We include runnning time and sample complexity analysis for our algorithms, including some negative results which hold using any algorithm. We further study outlier detection based on this distance measure and show that our framework can detect some gerrymandered maps. More precisely, we show some maps that are widely considered to be gerrymandered that lie very far away from our central maps in comparison to a large ensemble of valid redistricting maps. Since our distance measure does not rely on election results, this gives a significant advantage in gerrymandering detection which is lacking in all previous methods. \ No newline at end of file diff --git a/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals b/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals new file mode 100644 index 0000000000..e464ab8d08 --- /dev/null +++ b/data/2024/aaai/Implicit Modeling of Non-rigid Objects with Cross-Category Signals @@ -0,0 +1 @@ +Deep implicit functions (DIFs) have emerged as a potent and articulate means of representing 3D shapes. However, methods modeling object categories or non-rigid entities have mainly focused on single-object scenarios. In this work, we propose MODIF, a multi-object deep implicit function that jointly learns the deformation fields and instance-specific latent codes for multiple objects at once. Our emphasis is on non-rigid, non-interpenetrating entities such as organs. To effectively capture the interrelation between these entities and ensure precise, collision-free representations, our approach facilitates signaling between category-specific fields to adequately rectify shapes. We also introduce novel inter-object supervision: an attraction-repulsion loss is formulated to refine contact regions between objects. Our approach is demonstrated on various medical benchmarks, involving modeling different groups of intricate anatomical entities. Experimental results illustrate that our model can proficiently learn the shape representation of each organ and their relations to others, to the point that shapes missing from unseen instances can be consistently recovered by our method. Finally, MODIF can also propagate semantic information throughout the population via accurate point correspondences. \ No newline at end of file diff --git "a/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks" "b/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks" new file mode 100644 index 0000000000..2785cfc6b0 --- /dev/null +++ "b/data/2024/aaai/Improve Robustness of Reinforcement Learning against Observation Perturbations via l\342\210\236 Lipschitz Policy Networks" @@ -0,0 +1 @@ +Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks. However, recent works have revealed that DRL agents are susceptible to slight perturbations in observations. This vulnerability raises concerns regarding the effectiveness and robustness of deploying such agents in real-world applications. In this work, we propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations from the perspective of the network architecture. We employ a novel architecture for the policy network that incorporates global $l_\infty$ Lipschitz continuity and provide a convenient method to enhance policy robustness based on the output margin. Besides, a training framework is designed for SortRL, which solves given tasks while maintaining robustness against $l_\infty$ bounded perturbations on the observations. Several experiments are conducted to evaluate the effectiveness of our method, including classic control tasks and video games. The results demonstrate that SortRL achieves state-of-the-art robustness performance against different perturbation strength. \ No newline at end of file diff --git a/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm b/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm new file mode 100644 index 0000000000..7fe73f1cc3 --- /dev/null +++ b/data/2024/aaai/Improved Anonymous Multi-Agent Path Finding Algorithm @@ -0,0 +1 @@ +We consider an Anonymous Multi-Agent Path-Finding (AMAPF) problem where the set of agents is confined to a graph, a set of goal vertices is given and each of these vertices has to be reached by some agent. The problem is to find an assignment of the goals to the agents as well as the collision-free paths, and we are interested in finding the solution with the optimal makespan. A well-established approach to solve this problem is to reduce it to a special type of a graph search problem, i.e. to the problem of finding a maximum flow on an auxiliary graph induced by the input one. The size of the former graph may be very large and the search on it may become a bottleneck. To this end, we suggest a specific search algorithm that leverages the idea of exploring the search space not through considering separate search states but rather bulks of them simultaneously. That is, we implicitly compress, store and expand bulks of the search states as single states, which results in high reduction in runtime and memory. Empirically, the resultant AMAPF solver demonstrates superior performance compared to the state-of-the-art competitor and is able to solve all publicly available MAPF instances from the well-known MovingAI benchmark in less than 30 seconds. \ No newline at end of file diff --git a/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility b/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility new file mode 100644 index 0000000000..ab0ad6c5af --- /dev/null +++ b/data/2024/aaai/Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility @@ -0,0 +1 @@ +Two-sided matching markets have been widely studied in the literature due to their rich applications. Since participants are usually uncertain about their preferences, online algorithms have recently been adopted to learn them through iterative interactions. An existing work initiates the study of this problem in a many-to-one setting with responsiveness. However, their results are far from optimal and lack guarantees of incentive compatibility. We first extend an existing algorithm for the one-to-one setting to this more general setting and show it achieves a near-optimal bound for player-optimal regret. Nevertheless, due to the substantial requirement for collaboration, a single player's deviation could lead to a huge increase in its own cumulative rewards and a linear regret for others. In this paper, we aim to enhance the regret bound in many-to-one markets while ensuring incentive compatibility. We first propose the adaptively explore-then-deferred-acceptance (AETDA) algorithm for responsiveness setting and derive an upper bound for player-optimal stable regret while demonstrating its guarantee of incentive compatibility. This result is a significant improvement over existing works. And to the best of our knowledge, it constitutes the first player-optimal guarantee in matching markets that offers such robust assurances. We also consider broader substitutable preferences, one of the most general conditions to ensure the existence of a stable matching and cover responsiveness. We devise an online DA (ODA) algorithm and establish an upper bound for the player-pessimal stable regret for this setting. \ No newline at end of file diff --git a/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification b/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification new file mode 100644 index 0000000000..a076dbb4fd --- /dev/null +++ b/data/2024/aaai/Improved Graph Contrastive Learning for Short Text Classification @@ -0,0 +1,2 @@ +Text classification occupies an important role in natural language processing and has many applications in real life. Short text classification, as one of its subtopics, has attracted increasing interest from researchers since it is more challenging due to its semantic sparsity and insufficient labeled data. Recent studies attempt to combine graph learning and contrastive learning to alleviate the above problems in short text classification. Despite their fruitful success, there are still several inherent limitations. First, the generation of augmented views may disrupt the semantic structure within the text and introduce negative effects due to noise permutation. Second, they ignore the clustering-friendly features in unlabeled data and fail to further utilize the prior information in few valuable labeled data. To this end, we propose a novel model that utilizes improved Graph contrastIve learning for short text classiFicaTion (GIFT). Specifically, we construct a heterogeneous graph containing several component graphs by mining from an internal corpus and introducing an external knowledge graph. Then, we use singular value decomposition to generate augmented views for graph contrastive learning. Moreover, we employ constrained kmeans on labeled texts to learn clustering-friendly features, which facilitate cluster-oriented contrastive learning and assist in obtaining better category boundaries. Extensive experimental results show that GIFT significantly outperforms previous state-of-the-art methods. Our code can be found in +https://github.com/KEAML-JLU/GIFT. \ No newline at end of file diff --git a/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding b/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding new file mode 100644 index 0000000000..a7913069bc --- /dev/null +++ b/data/2024/aaai/Improved MLP Point Cloud Processing with High-Dimensional Positional Encoding @@ -0,0 +1 @@ +Multi-Layer Perceptron (MLP) models are the bedrock of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength. We first develop an “abstraction and refinement” (ABS-REF) view for the neural modeling of point clouds. This view elucidates that whereas the early models focused on the ABS stage, the more recent techniques devise sophisticated REF stages to attain performance advantage in point cloud processing. We then borrow the concept of “positional encoding” from transformer literature, and propose a High-dimensional Positional Encoding (HPE) module, which can be readily deployed to MLP based architectures. We leverage our module to develop a suite of HPENet, which are MLP networks that follow ABS-REF paradigm, albeit with a sophisticated HPE based REF stage. The developed technique is extensively evaluated for 3D object classification, object part segmentation, semantic segmentation and object detection. We establish new state-of-the-art results of 87.6 mAcc on ScanObjectNN for object classification, and 85.5 class mIoU on ShapeNetPart for object part segmentation, and 72.7 and 78.7 mIoU on Area-5 and 6-fold experiments with S3DIS for semantic segmentation. The source code for this work is available at https://github.com/zouyanmei/HPENet. \ No newline at end of file diff --git a/data/2024/aaai/Improved Metric Distortion via Threshold Approvals b/data/2024/aaai/Improved Metric Distortion via Threshold Approvals new file mode 100644 index 0000000000..2fdd6f0b1f --- /dev/null +++ b/data/2024/aaai/Improved Metric Distortion via Threshold Approvals @@ -0,0 +1 @@ +We consider a social choice setting in which agents and alternatives are represented by points in a metric space, and the cost of an agent for an alternative is the distance between the corresponding points in the space. The goal is to choose a single alternative to (approximately) minimize the social cost (cost of all agents) or the maximum cost of any agent, when only limited information about the preferences of the agents is given. Previous work has shown that the best possible distortion one can hope to achieve is 3 when access to the ordinal preferences of the agents is given, even when the distances between alternatives in the metric space are known. We improve upon this bound of 3 by designing deterministic mechanisms that exploit a bit of cardinal information. We show that it is possible to achieve distortion 1+sqrt(2) by using the ordinal preferences of the agents, the distances between alternatives, and a threshold approval set per agent that contains all alternatives for whom her cost is within an appropriately chosen factor of her cost for her most-preferred alternative. We show that this bound is the best possible for any deterministic mechanism in general metric spaces, and also provide improved bounds for the fundamental case of a line metric. \ No newline at end of file diff --git a/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation b/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation new file mode 100644 index 0000000000..abc38f04ac --- /dev/null +++ b/data/2024/aaai/Improving Audio-Visual Segmentation with Bidirectional Generation @@ -0,0 +1 @@ +The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. Code is released in: https://github.com/OpenNLPLab/AVS-bidirectional. \ No newline at end of file diff --git a/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models b/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models new file mode 100644 index 0000000000..5ad8efa3fe --- /dev/null +++ b/data/2024/aaai/Improving Automatic VQA Evaluation Using Large Language Models @@ -0,0 +1 @@ +8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments. \ No newline at end of file diff --git a/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks b/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks new file mode 100644 index 0000000000..511816af68 --- /dev/null +++ b/data/2024/aaai/Improving Autonomous Separation Assurance through Distributed Reinforcement Learning with Attention Networks @@ -0,0 +1 @@ +Advanced Air Mobility (AAM) introduces a new, efficient mode of transportation with the use of vehicle autonomy and electrified aircraft to provide increasingly autonomous transportation between previously underserved markets. Safe and efficient navigation of low altitude aircraft through highly dense environments requires the integration of a multitude of complex observations, such as surveillance, knowledge of vehicle dynamics, and weather. The processing and reasoning on these observations pose challenges due to the various sources of uncertainty in the information while ensuring cooperation with a variable number of aircraft in the airspace. These challenges coupled with the requirement to make safety-critical decisions in real-time rule out the use of conventional separation assurance techniques. We present a decentralized reinforcement learning framework to provide autonomous self-separation capabilities within AAM corridors with the use of speed and vertical maneuvers. The problem is formulated as a Markov Decision Process and solved by developing a novel extension to the sample-efficient, off-policy soft actor-critic (SAC) algorithm. We introduce the use of attention networks for variable-length observation processing and a distributed computing architecture to achieve high training sample throughput as compared to existing approaches. A comprehensive numerical study shows that the proposed framework can ensure safe and efficient separation of aircraft in high density, dynamic environments with various sources of uncertainty. \ No newline at end of file diff --git a/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning b/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning new file mode 100644 index 0000000000..7280d0aae7 --- /dev/null +++ b/data/2024/aaai/Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning @@ -0,0 +1 @@ +Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on high-quality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, not only does a modality gap exist between CLIP text and image features, but a discrepancy also arises between training and inference due to the unavailability of real-world images, which hinders the cross-modal alignment in text-only captioning. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space. Furthermore, textual information is gathered to represent image features, resulting in the image features with various semantics and the bridged modality gap. To unify training and inference, synthetic image features would serve as the training prefix for the language decoder, while real images are used for inference. Additionally, salient objects in images are detected as assistance to enhance the learning of modality alignment. Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction b/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction new file mode 100644 index 0000000000..6e1e2a8bc3 --- /dev/null +++ b/data/2024/aaai/Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction @@ -0,0 +1 @@ +Generative diffusion prior captured from the off-the-shelf denoising diffusion generative model has recently attained significant interest. However, several attempts have been made to adopt diffusion models to noisy inverse problems either fail to achieve satisfactory results or require a few thousand iterations to achieve high-quality reconstructions. In this work, we propose a diffusion-based image restoration with error contraction and error correction (DiffECC) method. Two strategies are introduced to contract the restoration error in the posterior sampling process. First, we combine existing CNN-based approaches with diffusion models to ensure data consistency from the beginning. Second, to amplify the error contraction effects of the noise, a restart sampling algorithm is designed. In the error correction strategy, the estimation-correction idea is proposed on both the data term and the prior term. Solving them iteratively within the diffusion sampling framework leads to superior image generation results. Experimental results for image restoration tasks such as super-resolution (SR), Gaussian deblurring, and motion deblurring demonstrate that our approach can reconstruct high-quality images compared with state-of-the-art sampling-based diffusion models. \ No newline at end of file diff --git a/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks b/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks new file mode 100644 index 0000000000..0c71255dc3 --- /dev/null +++ b/data/2024/aaai/Improving Distinguishability of Class for Graph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have received widespread attention and applications due to their excellent performance in graph representation learning. Most existing GNNs can only aggregate 1-hop neighbors in a GNN layer, so they usually stack multiple GNN layers to obtain more information from larger neighborhoods. However, many studies have shown that model performance experiences a significant degradation with the increase of GNN layers. In this paper, we first introduce the concept of distinguishability of class to indirectly evaluate the learned node representations, and verify the positive correlation between distinguishability of class and model performance. Then, we propose a Graph Neural Network guided by Distinguishability of class (Disc-GNN) to monitor the representation learning, so as to learn better node representations and improve model performance. Specifically, we first perform inter-layer filtering and initial compensation based on Local Distinguishability of Class (LDC) in each layer, so that the learned node representations have the ability to distinguish different classes. Furthermore, we add a regularization term based on Global Distinguishability of Class (GDC) to achieve global optimization of model performance. Extensive experiments on six real-world datasets have shown that the competitive performance of Disc-GNN to the state-of-the-art methods on node classification and node clustering tasks. \ No newline at end of file diff --git a/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction b/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction new file mode 100644 index 0000000000..7d728a594e --- /dev/null +++ b/data/2024/aaai/Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction @@ -0,0 +1 @@ +In recent years, spectral graph neural networks, characterized by polynomial filters, have garnered increasing attention and have achieved remarkable performance in tasks such as node classification. These models typically assume that eigenvalues for the normalized Laplacian matrix are distinct from each other, thus expecting a polynomial filter to have a high fitting ability. However, this paper empirically observes that normalized Laplacian matrices frequently possess repeated eigenvalues. Moreover, we theoretically establish that the number of distinguishable eigenvalues plays a pivotal role in determining the expressive power of spectral graph neural networks. In light of this observation, we propose an eigenvalue correction strategy that can free polynomial filters from the constraints of repeated eigenvalue inputs. Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, thus mitigating repeated eigenvalues, and improving the fitting capacity and expressive power of polynomial filters. Extensive experimental results on both synthetic and real-world datasets demonstrate the superiority of our method. \ No newline at end of file diff --git a/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors b/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors new file mode 100644 index 0000000000..05aedf932e --- /dev/null +++ b/data/2024/aaai/Improving Factual Error Correction by Learning to Inject Factual Errors @@ -0,0 +1 @@ +Factual error correction (FEC) aims to revise factual errors in false claims with minimal editing, making them faithful to the provided evidence. This task is crucial for alleviating the hallucination problem encountered by large language models. Given the lack of paired data (i.e., false claims and their corresponding correct claims), existing methods typically adopt the ‘mask-then-correct’ paradigm. This paradigm relies solely on unpaired false claims and correct claims, thus being referred to as distantly supervised methods. These methods require a masker to explicitly identify factual errors within false claims before revising with a corrector. However, the absence of paired data to train the masker makes accurately pinpointing factual errors within claims challenging. To mitigate this, we propose to improve FEC by Learning to Inject Factual Errors (LIFE), a three-step distantly supervised method: ‘mask-corrupt-correct’. Specifically, we first train a corruptor using the ‘mask-then-corrupt’ procedure, allowing it to deliberately introduce factual errors into correct text. The corruptor is then applied to correct claims, generating a substantial amount of paired data. After that, we filter out low-quality data, and use the remaining data to train a corrector. Notably, our corrector does not require a masker, thus circumventing the bottleneck associated with explicit factual error identification. Our experiments on a public dataset verify the effectiveness of LIFE in two key aspects: Firstly, it outperforms the previous best-performing distantly supervised method by a notable margin of 10.59 points in SARI Final (19.3% improvement). Secondly, even compared to ChatGPT prompted with in-context examples, LIFE achieves a superiority of 7.16 points in SARI Final. \ No newline at end of file diff --git a/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract) b/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract) new file mode 100644 index 0000000000..396ef673b4 --- /dev/null +++ b/data/2024/aaai/Improving Faithfulness in Abstractive Text Summarization with EDUs Using BART (Student Abstract) @@ -0,0 +1 @@ +Abstractive text summarization uses the summarizer’s own words to capture the main information of a source document in a summary. While it is more challenging to automate than extractive text summarization, recent advancements in deep learning approaches and pre-trained language models have improved its performance. However, abstractive text summarization still has issues such as unfaithfulness. To address this problem, we propose a new approach that utilizes important Elementary Discourse Units (EDUs) to guide BART-based text summarization. Our approach showed the improvement in truthfulness and source document coverage in comparison to some previous studies. \ No newline at end of file diff --git a/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies b/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies new file mode 100644 index 0000000000..100f582fd4 --- /dev/null +++ b/data/2024/aaai/Improving GNN Calibration with Discriminative Ability: Insights and Strategies @@ -0,0 +1 @@ +The widespread adoption of Graph Neural Networks (GNNs) has led to an increasing focus on their reliability. To address the issue of underconfidence in GNNs, various calibration methods have been developed to gain notable reductions in calibration error. However, we observe that existing approaches generally fail to enhance consistently, and in some cases even deteriorate, GNNs' ability to discriminate between correct and incorrect predictions. In this study, we advocate the significance of discriminative ability and the inclusion of relevant evaluation metrics. Our rationale is twofold: 1) Overlooking discriminative ability can inadvertently compromise the overall quality of the model; 2) Leveraging discriminative ability can significantly inform and improve calibration outcomes. Therefore, we thoroughly explore the reasons why existing calibration methods have ineffectiveness and even degradation regarding the discriminative ability of GNNs. Building upon these insights, we conduct GNN calibration experiments across multiple datasets using a straightforward example model, denoted as DC(GNN). Its excellent performance confirms the potential of integrating discriminative ability as a key consideration in the calibration of GNNs, thereby establishing a pathway toward more effective and reliable network calibration. \ No newline at end of file diff --git a/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms b/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms new file mode 100644 index 0000000000..3dabd9d56d --- /dev/null +++ b/data/2024/aaai/Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms @@ -0,0 +1 @@ +Harnessing the wide-spread availability of cell phones, many nonprofits have launched mobile health (mHealth) programs to deliver information via voice or text to beneficiaries in underserved communities, with maternal and infant health being a key area of such mHealth programs. Unfortunately, dwindling listenership is a major challenge, requiring targeted interventions using limited resources. This paper focuses on Kilkari, the world's largest mHealth program for maternal and child care -- with over 3 million active subscribers at a time -- launched by India's Ministry of Health and Family Welfare (MoHFW) and run by the non-profit ARMMAN. We present a system called CHAHAK that aims to reduce automated dropouts as well as boost engagement with the program through the strategic allocation of interventions to beneficiaries. Past work in a similar domain has focused on a much smaller scale mHealth program and used markovian restless multiarmed bandits to optimize a single limited intervention resource. However this paper demonstrates the challenges in adopting a markovian approach in Kilkari; therefore CHAHAK instead relies on non-markovian time-series restless bandits, and optimizes a layered set of multiple interventions to improve listenership. We use real Kilkari data from the Odisha state in India to show CHAHAK's effectiveness in harnessing multiple interventions to boost listenership, benefiting marginalized communities. When deployed CHAHAK will assist the largest maternal mHealth program to date. \ No newline at end of file diff --git a/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract) b/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract) new file mode 100644 index 0000000000..a234e8155d --- /dev/null +++ b/data/2024/aaai/Improving IP Geolocation With Target-Centric IP Graph (Student Abstract) @@ -0,0 +1 @@ +Accurate IP geolocation is indispensable for location-aware applications. While recent advances based on router-centric IP graphs are considered cutting-edge, one challenge remain: the prevalence of sparse IP graphs (14.24% with fewer than 10 nodes, 9.73% isolated) limits graph learning. To mitigate this issue, we designate the target host as the central node and aggregate multiple last-hop routers to construct the target-centric IP graph, instead of relying solely on the router with the smallest last-hop latency as in previous works. Experiments on three real-world datasets show that our method significantly improves the geolocation accuracy compared to existing baselines. \ No newline at end of file diff --git a/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis b/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis new file mode 100644 index 0000000000..2d76935904 --- /dev/null +++ b/data/2024/aaai/Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis @@ -0,0 +1 @@ +Large language models (LLMs) offer significant promise as a knowledge source for task learning. Prompt engineering has been shown to be effective for eliciting knowledge from an LLM, but alone it is insufficient for acquiring relevant, situationally grounded knowledge for an embodied agent learning novel tasks. We describe a cognitive-agent approach, STARS, that extends and complements prompt engineering, mitigating its limitations and thus enabling an agent to acquire new task knowledge matched to its native language capabilities, embodiment, environment, and user preferences. The STARS approach is to increase the response space of LLMs and deploy general strategies, embedded within the autonomous agent, to evaluate, repair, and select among candidate responses produced by the LLM. We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77-94% task completion in one-shot learning without user oversight. The approach achieves 100% task completion when human oversight (such as an indication of preference) is provided. Further, the type of oversight largely shifts from explicit, natural language instruction to simple confirmation/discomfirmation of high-quality responses that have been vetted by the agent before presentation to a user. \ No newline at end of file diff --git a/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting b/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting new file mode 100644 index 0000000000..37ab42260a --- /dev/null +++ b/data/2024/aaai/Improving Neural Network Generalization on Data-Limited Regression with Doubly-Robust Boosting @@ -0,0 +1,7 @@ +Enhancing the generalization performance of neural networks given limited data availability remains a formidable challenge, due to the model selection trade-off between training error and generalization gap. +To handle this challenge, we present a posterior optimization issue, specifically designed to reduce the generalization error of trained neural networks. +To operationalize this concept, we propose a Doubly-Robust Boosting machine (DRBoost) which consists of a statistical learner and a zero-order optimizer. +The statistical learner reduces the model capacity and thus the generalization gap; the zero-order optimizer minimizes the training error in a gradient-free manner. The two components cooperate to reduce the generalization error of a fully trained neural network in a doubly robust manner. +Furthermore, the statistical learner alleviates the multicollinearity in the discriminative layer and enhances the generalization performance. +The zero-order optimizer eliminates the reliance on gradient calculation and offers more flexibility in learning objective selection. +Experiments demonstrate that DRBoost improves the generalization performance of various prevalent neural network backbones effectively. \ No newline at end of file diff --git a/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge b/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge new file mode 100644 index 0000000000..be638c6bbb --- /dev/null +++ b/data/2024/aaai/Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge @@ -0,0 +1 @@ +Open Set Recognition (OSR) poses significant challenges in distinguishing known from unknown classes. In OSR, the overconfidence problem has become a persistent obstacle, where visual recognition models often misclassify unknown objects as known objects with high confidence. This issue stems from the fact that visual recognition models often lack the integration of common-sense knowledge, a feature that is naturally present in language-based models but lacking in visual recognition systems. In this paper, we propose a novel approach to enhance OSR performance by distilling common-sense knowledge into visual prompts. Utilizing text prompts that embody common-sense knowledge about known classes, the proposed visual prompt is learned by extracting semantic common-sense features and aligning them with image features from visual recognition models. The unique aspect of this work is the training of individual visual prompts for each class to encapsulate this common-sense knowledge. Our methodology is model-agnostic, capable of enhancing OSR across various visual recognition models, and computationally light as it focuses solely on training the visual prompts. This research introduces a method for addressing OSR, aiming at a more systematic integration of visual recognition systems with common-sense knowledge. The obtained results indicate an enhancement in recognition accuracy, suggesting the applicability of this approach in practical settings. \ No newline at end of file diff --git a/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge b/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge new file mode 100644 index 0000000000..3a82b5ba3f --- /dev/null +++ b/data/2024/aaai/Improving Open-Domain Dialogue Response Generation with Multi-Source Multilingual Commonsense Knowledge @@ -0,0 +1 @@ +Knowledge-grounded Dialogue Response Generation (KRG) can facilitate informative and fidelity dialogues using external knowledge. Prior monolingual works can only use the knowledge of the corresponding native language. Thus, due to the prohibitive costs of collecting and constructing external knowledge bases, the limited scale of accessible external knowledge always constrains the ability of KRG, especially in low-resource language scenarios. To this end, we propose a new task, Multi-Source Multilingual Knowledge-Grounded Response Generation (MMKRG), which simultaneously uses multiple knowledge sources of different languages. We notice that simply combining knowledge of different languages is inefficient due to the Cross-Conflict issue and Cross-Repetition issue. Thus, we propose a novel approach MMK-BART, which uses a simple but elegant Estimate-Cluster-Penalize mechanism to overcome the mentioned issues and adopts the multilingual language model mBART as the backbone. Meanwhile, based on the recent multilingual corpus XDailyDialog, we propose an MMKRG dataset MMK-DailyDialog, which has been aligned to the large-scale multilingual commonsense knowledge base ConceptNet and supports four languages (English, Chinese, German, and Italian). Extensive experiments have verified the effectiveness of our dataset and approach in monolingual, cross-lingual, and multilingual scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation b/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation new file mode 100644 index 0000000000..6008721f64 --- /dev/null +++ b/data/2024/aaai/Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation @@ -0,0 +1 @@ +Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG. \ No newline at end of file diff --git a/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields b/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields new file mode 100644 index 0000000000..4cabc8a518 --- /dev/null +++ b/data/2024/aaai/Improving Robustness for Joint Optimization of Camera Pose and Decomposed Low-Rank Tensorial Radiance Fields @@ -0,0 +1,13 @@ +In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. + +First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. + +Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. + +Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. + +To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. + +Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization. + +The source code is available at https://github.com/Nemo1999/Joint-TensoRF. \ No newline at end of file diff --git a/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection b/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection new file mode 100644 index 0000000000..c4e526ced7 --- /dev/null +++ b/data/2024/aaai/Improving the Adversarial Transferability of Vision Transformers with Virtual Dense Connection @@ -0,0 +1 @@ +With the great achievement of vision transformers (ViTs), transformer-based approaches have become the new paradigm for solving various computer vision tasks. However, recent research shows that similar to convolutional neural networks (CNNs), ViTs are still vulnerable to adversarial attacks. To explore the shared deficiency of models with different structures, researchers begin to analyze the cross-structure adversarial transferability, which is still under-explored. Therefore, in this work, we focus on the ViT attacks to improve the cross-structure transferability between the transformer-based and convolution-based models. Previous studies fail to thoroughly investigate the influence of the components inside the ViT models on adversarial transferability, leading to inferior performance. To overcome the drawback, we launch a motivating study by linearly down-scaling the gradients of components inside the ViT models to analyze their influence on adversarial transferability. Based on the motivating study, we find that the gradient of the skip connection most influences transferability and believe that back-propagating gradients from deeper blocks can enhance transferability. Therefore, we propose the Virtual Dense Connection method (VDC). Specifically, without changing the forward pass, we first recompose the original network to add virtual dense connections. Then we back-propagate gradients of deeper Attention maps and Multi-layer Perceptron (MLP) blocks via virtual dense connections when generating adversarial samples. Extensive experiments confirm the superiority of our proposed method over the state-of-the-art baselines, with an 8.2% improvement in transferability between ViT models and a 7.2% improvement in cross-structure transferability from ViTs to CNNs. \ No newline at end of file diff --git a/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning b/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning new file mode 100644 index 0000000000..4b4e1a3b76 --- /dev/null +++ b/data/2024/aaai/Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning @@ -0,0 +1 @@ +Knowledge-grounded dialogue (KGD) learns to generate an informative response based on a given dialogue context and external knowledge (e.g., knowledge graphs; KGs). Recently, the emergence of large language models (LLMs) and pre-training techniques has brought great success to knowledge-grounded dialogue. However, when building KGD systems in real applications, there are various real-world noises that are inevitable to face. For example, the dialogue context might involve perturbations such as misspellings and abbreviations. In addition, KGs typically suffer from incompletion and also might contain erroneous and outdated facts. Such real-world noises pose a challenge to the robustness of KGD systems and hinder their applications in the real world. In this paper, we propose an entity-based contrastive learning framework for improving the robustness of KGD. Specifically, we make use of the entity information in a KGD sample to create both its positive and negative samples which involve semantic-irrelevant and semantic-relevant perturbations, respectively. The contrastive learning framework ensures the KGD model is aware of these two types of perturbations, thus could generate informative responses with the potentially noisy inputs in real applications. Experimental results on three widely-used benchmark datasets show that our method achieves new state-of-the-art performance in terms of automatic evaluation scores, verifying its effectiveness and potentiality. Furthermore, we show that our method is able to generate better responses than comparison models in both the noisy and the few-shot settings. \ No newline at end of file diff --git a/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video b/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video new file mode 100644 index 0000000000..7e05234a5e --- /dev/null +++ b/data/2024/aaai/In-Hand 3D Object Reconstruction from a Monocular RGB Video @@ -0,0 +1 @@ +Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic hand-held object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contact region due to occlusion. In this paper, we propose a novel method that deals with surface reconstruction under occlusion by incorporating priors of 2D occlusion elucidation and physical contact constraints. For the former, we introduce an object amodal completion network to infer the 2D complete mask of objects under occlusion. To ensure the accuracy and view consistency of the predicted 2D amodal masks, we devise a joint optimization method for both amodal mask refinement and 3D reconstruction. For the latter, we impose penetration and attraction constraints on the local geometry in contact regions. We evaluate our approach on HO3D and HOD datasets and demonstrate that it outperforms the state-of-the-art methods in terms of reconstruction surface quality, with an improvement of 52% on HO3D and 20% on HOD. Project webpage: https://east-j.github.io/ihor. \ No newline at end of file diff --git a/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract) b/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract) new file mode 100644 index 0000000000..7cbe486615 --- /dev/null +++ b/data/2024/aaai/IncepSeqNet: Advancing Signal Classification with Multi-Shape Augmentation (Student Abstract) @@ -0,0 +1 @@ +This work proposes and analyzes IncepSeqNet which is a new model combining the Inception Module with the innovative Multi-Shape Augmentation technique. IncepSeqNet excels in feature extraction from sequence signal data consisting of a number of complex numbers to achieve superior classification accuracy across various SNR(Signal-to-Noise Ratio) environments. Experimental results demonstrate IncepSeqNet’s outperformance of existing models, particularly at low SNR levels. Furthermore, we have confirmed its applicability in practical 5G systems by using real-world signal data. \ No newline at end of file diff --git a/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding b/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding new file mode 100644 index 0000000000..043dbc80d7 --- /dev/null +++ b/data/2024/aaai/Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding @@ -0,0 +1 @@ +Incomplete multi-view clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC. The version with supplementary material can be found at http://arxiv.org/abs/2312.08697. \ No newline at end of file diff --git a/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation b/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation new file mode 100644 index 0000000000..7b6488a37e --- /dev/null +++ b/data/2024/aaai/Inconsistency-Based Data-Centric Active Open-Set Annotation @@ -0,0 +1 @@ +Active learning, a method to reduce labeling effort for training deep neural networks, is often limited by the assumption that all unlabeled data belong to known classes. This closed-world assumption fails in practical scenarios with unknown classes in the data, leading to active open-set annotation challenges. Existing methods struggle with this uncertainty. We introduce NEAT, a novel, computationally efficient, data-centric active learning approach for open-set data. NEAT differentiates and labels known classes from a mix of known and unknown classes, using a clusterability criterion and a consistency mea- sure that detects inconsistencies between model predictions and feature distribution. In contrast to recent learning-centric solutions, NEAT shows superior performance in active open- set annotation, as our experiments confirm. Additional details on the further evaluation metrics, implementation, and archi- tecture of our method can be found in the public document at https://arxiv.org/pdf/2401.04923.pdf. \ No newline at end of file diff --git a/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract) b/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract) new file mode 100644 index 0000000000..7a8a5ecc30 --- /dev/null +++ b/data/2024/aaai/Incorporating Serverless Computing into P2P Networks for ML Training: In-Database Tasks and Their Scalability Implications (Student Abstract) @@ -0,0 +1 @@ +Distributed ML addresses challenges from increasing data and model complexities. Peer to peer (P2P) networks in distributed ML offer scalability and fault tolerance. However, they also encounter challenges related to resource consumption, and communication overhead as the number of participating peers grows. This research introduces a novel architecture that combines serverless computing with P2P networks for distributed training. Serverless computing enhances this model with parallel processing and cost effective scalability, suitable for resource-intensive tasks. Preliminary results show that peers can offload expensive computational tasks to serverless platforms. However, their inherent statelessness necessitates strong communication methods, suggesting a pivotal role for databases. To this end, we have enhanced an in memory database to support ML training tasks. \ No newline at end of file diff --git a/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion b/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion new file mode 100644 index 0000000000..7dcdac4731 --- /dev/null +++ b/data/2024/aaai/Independence of Irrelevant Alternatives under the Lens of Pairwise Distortion @@ -0,0 +1,2 @@ +We give a quantitative analysis of the independence of irrelevant alternatives (IIA) axiom. IIA says that the society's preference between x and y should depend only on individual preferences between x and y: we show that, in several contexts, if the individuals express their preferences about additional (``irrelevant'') alternatives, this information helps to estimate better which of x and y has higher social welfare. +Our contribution is threefold: (1) we provide a new tool to measure the impact of IIA on social welfare (pairwise distortion), based on the well-established notion of voting distortion, (2) we study the average impact of IIA in both general and metric settings, with experiments on synthetic and real data and (3) we study the worst-case impact of IIA in the 1D-Euclidean metric space. \ No newline at end of file diff --git a/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation b/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation new file mode 100644 index 0000000000..e8b604fc4c --- /dev/null +++ b/data/2024/aaai/Independency Adversarial Learning for Cross-Modal Sound Separation @@ -0,0 +1 @@ +The sound mixture separation is still challenging due to heavy sound overlapping and disturbance from noise. Unsupervised separation would significantly increase the difficulty. As sound overlapping always hinders accurate sound separation, we propose an Independency Adversarial Learning based Cross-Modal Sound Separation (IAL-CMS) approach, where IAL employs adversarial learning to minimize the correlation of separated sound elements, exploring high sound independence; CMS performs cross-modal sound separation, incorporating audio-visual consistent feature learning and interactive cross-attention learning to emphasize the semantic consistency among cross-modal features. Both audio-visual consistency and audio consistency are kept to guarantee accurate separation. The consistency and sound independence ensure the decomposition of overlapping mixtures into unrelated and distinguishable sound elements. The proposed approach is evaluated on MUSIC, VGGSound, and AudioSet. Extensive experiments certify that our approach outperforms existing approaches in supervised and unsupervised scenarios. \ No newline at end of file diff --git a/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context b/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context new file mode 100644 index 0000000000..7bcd503c58 --- /dev/null +++ b/data/2024/aaai/IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context @@ -0,0 +1,17 @@ +Hate speech (HS) is a growing concern in many parts of +the world, including India, where it has led to numerous instances of violence and discrimination. The development of +effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this +area, especially in non-English languages. In this paper, we +introduce a new dataset, IndicCONAN, of counter-narratives +against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model +through machine generation - human correction cycle, where +the model uses augmented data from previous cycles to generate new training samples. These newly generated samples +are then reviewed and edited by annotators, leading to further +model refnement. The dataset consists of over 2,500 exam- ˜ +ples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We +also present a framework for generating CNs conditioned on +specifc CN type with a mean perplexity of 3.85 for English +and 3.70 for Hindi, a mean toxicity score of 0.04 for English +and 0.06 for Hindi, and a mean diversity of 0.08 for English +and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to +combat hate speech in the Indian context. \ No newline at end of file diff --git a/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data b/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data new file mode 100644 index 0000000000..36443b8583 --- /dev/null +++ b/data/2024/aaai/Inducing Clusters Deep Kernel Gaussian Process for Longitudinal Data @@ -0,0 +1 @@ +We consider the problem of predictive modeling from irregularly and sparsely sampled longitudinal data with unknown, complex correlation structures and abrupt discontinuities. To address these challenges, we introduce a novel inducing clusters longitudinal deep kernel Gaussian Process (ICDKGP). ICDKGP approximates the data generating process by a zero-mean GP with a longitudinal deep kernel that models the unknown complex correlation structure in the data and a deterministic non-zero mean function to model the abrupt discontinuities. To improve the scalability and interpretability of ICDKGP, we introduce inducing clusters corresponding to centers of clusters in the training data. We formulate the training of ICDKGP as a constrained optimization problem and derive its evidence lower bound. We introduce a novel relaxation of the resulting problem which under rather mild assumptions yields a solution with error bounded relative to the original problem. We describe the results of extensive experiments demonstrating that ICDKGP substantially outperforms the state-of-the-art longitudinal methods on data with both smoothly and non-smoothly varying outcomes. \ No newline at end of file diff --git a/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy b/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy new file mode 100644 index 0000000000..35f207c7fc --- /dev/null +++ b/data/2024/aaai/Inertial Algorithm with Dry Fraction and Convolutional Sparse Coding for 3D Localization with Light Field Microscopy @@ -0,0 +1 @@ +Light field microscopy is a high-speed 3D imaging technique that records the light field from multiple angles by the microlens array(MLA), thus allowing us to obtain information about the light source from a single image only. For the fundamental problem of neuron localization, we improve the method of combining depth-dependent dictionary with sparse coding in this paper. In order to obtain higher localization accuracy and good noise immunity, we propose an inertial proximal gradient acceleration algorithm with dry friction, Fast-IPGDF. By preventing falling into a local minimum, our algorithm achieves better convergence and converges quite fast, which improves the speed and accuracy of obtaining the locolization of the light source based on the matching depth of epipolar plane images (EPI). We demonstrate the effectiveness of the algorithm for localizing non-scattered fluorescent beads in both noisy and non-noisy environments. The experimental results show that our method can achieve simultaneous localization of multiple point sources and effective localization in noisy environments. Compared to existing studies, our method shows significant improvements in both localization accuracy and speed. \ No newline at end of file diff --git a/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation b/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation new file mode 100644 index 0000000000..039fd9d41a --- /dev/null +++ b/data/2024/aaai/Inference and Learning in Dynamic Decision Networks Using Knowledge Compilation @@ -0,0 +1 @@ +Decision making under uncertainty in dynamic environments is a fundamental AI problem in which agents need to determine which decisions (or actions) to make at each time step to maximise their expected utility. Dynamic decision networks (DDNs) are an extension of dynamic Bayesian networks with decisions and utilities. DDNs can be used to compactly represent Markov decision processes (MDPs). We propose a novel algorithm called mapl-cirup that leverages knowledge compilation techniques developed for (dynamic) Bayesian networks to perform inference and gradient-based learning in DDNs. Specifically, we knowledge-compile the Bellman update present in DDNs into dynamic decision circuits and evaluate them within an (algebraic) model counting framework. In contrast to other exact symbolic MDP approaches, we obtain differentiable circuits that enable gradient-based parameter learning. \ No newline at end of file diff --git a/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems b/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems new file mode 100644 index 0000000000..0d845558b7 --- /dev/null +++ b/data/2024/aaai/Influential Exemplar Replay for Incremental Learning in Recommender Systems @@ -0,0 +1 @@ +Personalized recommender systems have found widespread applications for effective information filtering. Conventional models engage in knowledge mining within the static setting to reconstruct singular historical data. Nonetheless, the dynamics of real-world environments are in a constant state of flux, rendering acquired model knowledge inadequate for accommodating emergent trends and thus leading to notable recommendation performance decline. Given the typically prohibitive cost of exhaustive model retraining, it has emerged to study incremental learning for recommender systems with ever-growing data. In this paper, we propose an effective model-agnostic framework, namely INFluential Exemplar Replay (INFER). INFER facilitates recommender models in retaining the earlier assimilated knowledge, e.g., users' enduring preferences, while concurrently accommodating evolving trends manifested in users' new interaction behaviors. We commence with a vanilla implementation that centers on identifying the most representative data samples for effective consolidation of early knowledge. Subsequently, we propose an advanced solution, namely INFERONCE, to optimize the computational overhead associated with the vanilla implementation. Extensive experiments on four prototypical backbone models, two classic recommendation tasks, and four widely used benchmarks consistently demonstrate the effectiveness of our method as well as its compatibility for extending to several incremental recommender models. \ No newline at end of file diff --git a/data/2024/aaai/Information Design for Congestion Games with Unknown Demand b/data/2024/aaai/Information Design for Congestion Games with Unknown Demand new file mode 100644 index 0000000000..ddd1608c4d --- /dev/null +++ b/data/2024/aaai/Information Design for Congestion Games with Unknown Demand @@ -0,0 +1,4 @@ +We study a novel approach to information design in the standard traffic model of network congestion games. It captures the natural condition that the demand is unknown to the users of the network. A principal (e.g., a mobility service) commits to a signaling strategy, observes the realized demand and sends a (public) signal to agents (i.e., users of the network). Based on the induced belief about the demand, the users then form an equilibrium. We consider the algorithmic goal of the principal: Compute a signaling scheme that minimizes the expected total cost of the induced equilibrium. We concentrate on single-commodity networks and affine cost functions, for which we obtain the following results. + +First, we devise a fully polynomial-time approximation scheme (FPTAS) for the case that the demand can only take two values. It relies on several structural properties of the cost of the induced equilibrium as a function of the updated belief about the distribution of demands. We show that this function is piecewise linear for any number of demands, and monotonic for two demands. +Second, we give a complete characterization of the graph structures for which it is optimal to fully reveal the information about the realized demand. This signaling scheme turns out to be optimal for all cost functions and probability distributions over demands if and only if the graph is series-parallel. Third, we propose an algorithm that computes the optimal signaling scheme for any number of demands whose time complexity is polynomial in the number of supports that occur in a Wardrop equilibrium for some demand. Finally, we conduct a computational study that tests this algorithm on real-world instances. \ No newline at end of file diff --git a/data/2024/aaai/Input Margins Can Predict Generalization Too b/data/2024/aaai/Input Margins Can Predict Generalization Too new file mode 100644 index 0000000000..2a4f5c1a18 --- /dev/null +++ b/data/2024/aaai/Input Margins Can Predict Generalization Too @@ -0,0 +1,2 @@ +Understanding generalization in deep neural networks is an active area of research. A promising avenue of exploration has been that of margin measurements: the shortest distance to the decision boundary for a given sample or its representation internal to the network. While margins have been shown to be correlated with the generalization ability of a model when measured at its hidden representations (hidden margins), no such link between large margins and generalization has been established for input margins. We show that while input margins are not generally predictive of generalization, they can be if the search space is appropriately constrained. +We develop such a measure based on input margins, which we refer to as 'constrained margins'. The predictive power of this new measure is demonstrated on the 'Predicting Generalization in Deep Learning' (PGDL) dataset and contrasted with hidden representation margins. We find that constrained margins achieve highly competitive scores and outperform other margin measurements in general. This provides a novel insight on the relationship between generalization and classification margins, and highlights the importance of considering the data manifold for investigations of generalization in DNNs. \ No newline at end of file diff --git a/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks b/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks new file mode 100644 index 0000000000..6ef6cc18ae --- /dev/null +++ b/data/2024/aaai/Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks @@ -0,0 +1,2 @@ +Backdoor attacks have been shown to be a serious security threat against deep learning models, and various defenses have been proposed to detect whether a model is backdoored or not. However, as indicated by a recent black-box attack, existing defenses can be easily bypassed by implanting the backdoor in the frequency domain. +To this end, we propose a new defense DTInspector against black-box backdoor attacks, based on a new observation related to the prediction confidence of learning models. That is, to achieve a high attack success rate with a small amount of poisoned data, backdoor attacks usually render a model exhibiting statistically higher prediction confidences on the poisoned samples. We provide both theoretical and empirical evidence for the generality of this observation. DTInspector then carefully examines the prediction confidences of data samples, and decides the existence of backdoor using the shortcut nature of backdoor triggers. Extensive evaluations on six backdoor attacks, four datasets, and three advanced attacking types demonstrate the effectiveness of the proposed defense. \ No newline at end of file diff --git a/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning b/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning new file mode 100644 index 0000000000..432eb7be00 --- /dev/null +++ b/data/2024/aaai/Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning @@ -0,0 +1 @@ +Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs. \ No newline at end of file diff --git a/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning b/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning new file mode 100644 index 0000000000..9377a9f79d --- /dev/null +++ b/data/2024/aaai/Instance-Conditional Timescales of Decay for Non-Stationary Learning @@ -0,0 +1 @@ +Slow concept drift is a ubiquitous, yet under-studied problem in practical machine learning systems. In such settings, although recent data is more indicative of future data, naively prioritizing recent instances runs the risk of losing valuable information from the past. We propose an optimization-driven approach towards balancing instance importance over large training windows. First, we model instance relevance using a mixture of multiple timescales of decay, allowing us to capture rich temporal trends. Second, we learn an auxiliary scorer model that recovers the appropriate mixture of timescales as a function of the instance itself. Finally, we propose a nested optimization objective for learning the scorer, by which it maximizes forward transfer for the learned model. Experiments on a large real-world dataset of 39M photos over a 9 year period show upto 15% relative gains in accuracy compared to other robust learning baselines. We replicate our gains on two collections of real-world datasets for non-stationary learning, and extend our work to continual learning settings where, too, we beat SOTA methods by large margins. \ No newline at end of file diff --git a/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract) b/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract) new file mode 100644 index 0000000000..090379b2b7 --- /dev/null +++ b/data/2024/aaai/Instance-Wise Laplace Mechanism via Deep Reinforcement Learning (Student Abstract) @@ -0,0 +1,5 @@ +Recent research has shown a growing interest in per-instance differential privacy (pDP), highlighting the fact that each data instance within a dataset may incur distinct levels of privacy loss. +However, conventional additive noise mechanisms apply identical noise to all query outputs, thereby deteriorating data statistics. +In this study, we propose an instance-wise Laplace mechanism, which adds non-identical Laplace noises to the query output for each data instance. +A challenge arises from the complex interaction of additive noise, where the noise introduced to individual instances impacts the pDP of other instances, adding complexity and resilience to straightforward solutions. +To tackle this problem, we introduce an instance-wise Laplace mechanism algorithm via deep reinforcement learning and validate its ability to better preserve data statistics on a real dataset, compared to the original Laplace mechanism. \ No newline at end of file diff --git a/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding b/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding new file mode 100644 index 0000000000..623435474d --- /dev/null +++ b/data/2024/aaai/Integer Is Enough: When Vertical Federated Learning Meets Rounding @@ -0,0 +1,7 @@ +Vertical Federated Learning (VFL) is a solution increasingly used by companies with the same user group but differing features, enabling them to collaboratively train a machine learning model. +VFL ensures that clients exchange intermediate results extracted by their local models, without sharing raw data. +However, in practice, VFL encounters several challenges, such as computational and communication overhead, privacy leakage risk, and adversarial attack. +Our study reveals that the usage of floating-point (FP) numbers is a common factor causing these issues, as they can be redundant and contain too much information. +To address this, we propose a new architecture called rounding layer, which converts intermediate results to integers. +Our theoretical analysis and empirical results demonstrate the benefits of the rounding layer in reducing computation and memory overhead, providing privacy protection, preserving model performance, and mitigating adversarial attacks. +We hope this paper inspires further research into novel architectures to address practical issues in VFL. \ No newline at end of file diff --git a/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision b/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision new file mode 100644 index 0000000000..511c266066 --- /dev/null +++ b/data/2024/aaai/Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision @@ -0,0 +1 @@ +Attribution algorithms are frequently employed to explain the decisions of neural network models. Integrated Gradients (IG) is an influential attribution method due to its strong axiomatic foundation. The algorithm is based on integrating the gradients along a path from a reference image to the input image. Unfortunately, it can be observed that gradients computed from regions where the output logit changes minimally along the path provide poor explanations for the model decision, which is called the saturation effect problem. In this paper, we propose an attribution algorithm called integrated decision gradients (IDG). The algorithm focuses on integrating gradients from the region of the path where the model makes its decision, i.e., the portion of the path where the output logit rapidly transitions from zero to its final value. This is practically realized by scaling each gradient by the derivative of the output logit with respect to the path. The algorithm thereby provides a principled solution to the saturation problem. Additionally, we minimize the errors within the Riemann sum approximation of the path integral by utilizing non-uniform subdivisions determined by adaptive sampling. In the evaluation on ImageNet, it is demonstrated that IDG outperforms IG, Left-IG, Guided IG, and adversarial gradient integration both qualitatively and quantitatively using standard insertion and deletion metrics across three common models. \ No newline at end of file diff --git a/data/2024/aaai/Integrated Systems for Computational Scientific Discovery b/data/2024/aaai/Integrated Systems for Computational Scientific Discovery new file mode 100644 index 0000000000..0ef825aa03 --- /dev/null +++ b/data/2024/aaai/Integrated Systems for Computational Scientific Discovery @@ -0,0 +1,7 @@ +This paper poses the challenge of developing and evaluating integrated +systems for computational scientific discovery. We note some distinguishing +characteristics of discovery tasks, examine eight component abilities, +review previous successes at partial integration, and consider hurdles +the AI research community must leap to transform the vision for +integrated discovery into reality. In closing, we discuss promising +scientific domains in which to test such computational artifacts. \ No newline at end of file diff --git a/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models b/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models new file mode 100644 index 0000000000..3dd5897684 --- /dev/null +++ b/data/2024/aaai/Integrating Neural Pathways for Learning in Deep Reinforcement Learning Models @@ -0,0 +1 @@ +Considering that the human brain is the most powerful, generalizable, and energy-efficient computer we know of, it makes the most sense to look to neuroscience for ideas regarding deep learning model improvements. I propose one such idea, augmenting a traditional Advantage-Actor-Critic (A2C) model with additional learning signals akin to those in the brain. Pursuing this direction of research should hopefully result in a new reinforcement learning (RL) control paradigm that can learn from fewer examples, train with greater stability, and possibly consume less energy. \ No newline at end of file diff --git a/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process b/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process new file mode 100644 index 0000000000..82071d8623 --- /dev/null +++ b/data/2024/aaai/Intelligent Calibration for Bias Reduction in Sentiment Corpora Annotation Process @@ -0,0 +1 @@ +This paper focuses in the inherent anchoring bias present in sequential reviews-sentiment corpora annotation processes. It proposes employing a limited subset of meticulously chosen reviews at the outset of the process, as a means of calibration, effectively mitigating the phenomenon. Through extensive experimentation we validate the phenomenon of sentiment bias in the annotation process and show that its magnitude can be influenced by pre-calibration. Furthermore, we show that the choice of the calibration set matters, hence the need for effective guidelines for choosing the reviews to be included in it. A comparison of annotators performance with the proposed calibration to annotation processes that do not use calibration or use a randomly-picked calibration set, reveals that indeed the calibration set picked is highly effective---it manages to substantially reduce the average absolute error compared to the other cases. Furthermore, the proposed selection guidelines are found to be highly robust in picking an effective calibration set also for domains different than the one based on which these rules were extracted. \ No newline at end of file diff --git a/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution b/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution new file mode 100644 index 0000000000..a84c7e994a --- /dev/null +++ b/data/2024/aaai/Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution @@ -0,0 +1 @@ +Human intention understanding in untrimmed videos aims to watch a natural video and predict what the person’s intention is. Currently, exploration of predicting human intentions in untrimmed videos is far from enough. On the one hand, untrimmed videos with mixed actions and backgrounds have a significant long-tail distribution with concept drift characteristics. On the other hand, most methods can only perceive instantaneous intentions, but cannot determine the evolution of intentions. To solve the above challenges, we propose a loss based on Instance Confidence and Class Accuracy (ICCA), which aims to alleviate the prediction bias caused by the long-tail distribution with concept drift characteristics in video streams. In addition, we propose an intention-oriented evolutionary learning method to determine the intention evolution pattern (from what action to what action) and the time of evolution (when the action evolves). We conducted extensive experiments on two untrimmed video datasets (THUMOS14 and ActivityNET v1.3), and our method has achieved excellent results compared to SOTA methods. The code and supplementary materials are available at https://github.com/Jennifer123www/UntrimmedVideo. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Human-Centric Bias Mitigation b/data/2024/aaai/Interactive Human-Centric Bias Mitigation new file mode 100644 index 0000000000..97bf98a8bf --- /dev/null +++ b/data/2024/aaai/Interactive Human-Centric Bias Mitigation @@ -0,0 +1 @@ +Bias mitigation algorithms differ in their definition of bias and how they go about achieving that objective. Bias mitigation algorithms impact different cohorts differently and allowing end users and data scientists to understand the impact of these differences in order to make informed choices is a relatively unexplored domain. This demonstration presents an interactive bias mitigation pipeline that allows users to understand the cohorts impacted by their algorithm choice and provide feedback in order to provide a bias mitigated pipeline that most aligns with their goals. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning b/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning new file mode 100644 index 0000000000..2845168559 --- /dev/null +++ b/data/2024/aaai/Interactive Hyperparameter Optimization in Multi-Objective Problems via Preference Learning @@ -0,0 +1,9 @@ +Hyperparameter optimization (HPO) is important to leverage the full potential of machine learning (ML). +In practice, users are often interested in multi-objective (MO) problems, i.e., optimizing potentially conflicting objectives, like accuracy and energy consumption. +To tackle this, the vast majority of MO-ML algorithms return a Pareto front of non-dominated machine learning models to the user. +Optimizing the hyperparameters of such algorithms is non-trivial as evaluating a hyperparameter configuration entails evaluating the quality of the resulting Pareto front. +In literature, there are known indicators that assess the quality of a Pareto front (e.g., hypervolume, R2) by quantifying different properties (e.g., volume, proximity to a reference point). However, choosing the indicator that leads to the desired Pareto front might be a hard task for a user. In this paper, we propose a human-centered interactive HPO approach tailored towards multi-objective ML leveraging preference learning to extract desiderata from users that guide the optimization. +Instead of relying on the user guessing the most suitable indicator for their needs, our approach automatically learns an appropriate indicator. +Concretely, we leverage pairwise comparisons of distinct Pareto fronts to learn such an appropriate quality indicator. +Then, we optimize the hyperparameters of the underlying MO-ML algorithm towards this learned indicator using a state-of-the-art HPO approach. +In an experimental study targeting the environmental impact of ML, we demonstrate that our approach leads to substantially better Pareto fronts compared to optimizing based on a wrong indicator pre-selected by the user, and performs comparable in the case of an advanced user knowing which indicator to pick. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning b/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning new file mode 100644 index 0000000000..ec9d298f85 --- /dev/null +++ b/data/2024/aaai/Interactive Mars Image Content-Based Search with Interpretable Machine Learning @@ -0,0 +1 @@ +The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The ever-expanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction b/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction new file mode 100644 index 0000000000..3e03b77185 --- /dev/null +++ b/data/2024/aaai/Interactive Plan Selection Using Linear Temporal Logic, Disjunctive Action Landmarks, and Natural Language Instruction @@ -0,0 +1 @@ +We present Lemming – a visualization tool for the interactive selection of plans for a given problem, allowing the user to efficiently whittle down the set of plans and select their plan(s) of choice. We demonstrate four different user experiences for this process, three of them based on the principle of using disjunctive action landmarks as guidance to cut down the set of choice points for the user, and one on the use of linear temporal logic (LTL) to impart additional constraints into the plan set using natural language (NL) instruction. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges b/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges new file mode 100644 index 0000000000..462b41ba66 --- /dev/null +++ b/data/2024/aaai/Interactive Theorem Provers: Applications in AI, Opportunities, and Challenges @@ -0,0 +1,3 @@ +Interactive theorem provers (ITPs) are computer programs in which axioms and a conjecture are stated in a formal language, and a user provides the ITP with relatively high-level steps of a formal proof for the conjecture. Then, by invoking automated theorem provers, the ITP tries to generate low-level steps that fill the gaps between the steps provided by the user, thus forming a complete formal proof of the conjecture. The ITP also checks the entire formal proof against the axioms, thus confirming the soundness of all derivations in the formal proof. + +In this talk, I will discuss the existing opportunities and potential benefits to applying ITPs to reason about and verify AI concepts, algorithms, and software. I will also discuss the challenges we have to being able to apply ITPs in AI and reap those benefits. I will do so by discussing a number of my previous projects on the application of ITPs to different AI concepts, algorithms, and software systems. These projects span different areas of planning (classical planning, temporal planning, and planning under uncertainty) as well as algorithms with applications in algorithmic game theory, like general graph matching and online matching. \ No newline at end of file diff --git a/data/2024/aaai/Interactive Visual Task Learning for Robots b/data/2024/aaai/Interactive Visual Task Learning for Robots new file mode 100644 index 0000000000..c3971e0bd1 --- /dev/null +++ b/data/2024/aaai/Interactive Visual Task Learning for Robots @@ -0,0 +1,8 @@ +We present a demonstrable framework for robots to learn novel visual concepts and visual tasks via in-situ linguistic interactions with human users. Previous approaches in computer vision have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies and take this ability one step further to demonstrate novel task solving on robots along with the learned visual concepts. +To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. +Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept, that is being taught, to its parent nodes within a concept hierarchy. +This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. +Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. +Combining the two techniques, we present a demonstration on a real robot that learns visual task and concepts in one-shot from in-situ interactions with human users, and generalize to perform a novel visual task of the same type in zero-shot. +As shown by the studies in the main conference paper, our system achieves a success rate of 50% on solving the whole task correctly with generalization where the baseline performs at 14% without any ability to generalize to novel tasks and concepts. +We will demonstrate our working interactive learning pipeline at AAAI 2024 in person with our robot and other required hardware. \ No newline at end of file diff --git a/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning b/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning new file mode 100644 index 0000000000..c6377592c8 --- /dev/null +++ b/data/2024/aaai/InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning @@ -0,0 +1 @@ +The hybrid automatic readability assessment (ARA) models that combine deep and linguistic features have recently received rising attention due to their impressive performance. However, the utilization of linguistic features is not fully realized, as ARA models frequently concentrate excessively on numerical values of these features, neglecting valuable structural information embedded within them. This leads to limited contribution of linguistic features in these hybrid ARA models, and in some cases, it may even result in counterproductive outcomes. In this paper, we propose a novel hybrid ARA model named InterpretARA through introducing a linguistic interpreter to better comprehend the structural information contained in linguistic features, and leveraging the contrastive learning that enables the model to understand relative difficulty relationships among texts and thus enhances deep representations. Both document-level and segment-level deep representations are extracted and used for the readability assessment. A series of experiments are conducted over four English corpora and one Chinese corpus to demonstrate the effectiveness of the proposed model. Experimental results show that InterpretARA outperforms state-of-the-art models in most corpora, and the introduced linguistic interpreter can provide more useful information than existing ways for ARA. \ No newline at end of file diff --git a/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations b/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations new file mode 100644 index 0000000000..07e8fe2b68 --- /dev/null +++ b/data/2024/aaai/Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations @@ -0,0 +1 @@ +Prototypical parts-based networks are becoming increasingly popular due to their faithful self-explanations. However, their similarity maps are calculated in the penultimate network layer. Therefore, the receptive field of the prototype activation region often depends on parts of the image outside this region, which can lead to misleading interpretations. We name this undesired behavior a spatial explanation misalignment and introduce an interpretability benchmark with a set of dedicated metrics for quantifying this phenomenon. In addition, we propose a method for misalignment compensation and apply it to existing state-of-the-art models. We show the expressiveness of our benchmark and the effectiveness of the proposed compensation methodology through extensive empirical studies. \ No newline at end of file diff --git a/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models b/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models new file mode 100644 index 0000000000..00ba9d339e --- /dev/null +++ b/data/2024/aaai/Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models @@ -0,0 +1 @@ +Many individuals are likely to face a legal dispute at some point in their lives, but their lack of understanding of how to navigate these complex issues often renders them vulnerable. The advancement of natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. However, existing legal question answering (LQA) approaches often suffer from a narrow scope, being either confined to specific legal domains or limited to brief, uninformative responses. In this work, we propose an end-to-end methodology designed to generate long-form answers to any statutory law questions, utilizing a "retrieve-then-read" pipeline. To support this approach, we introduce and release the Long-form Legal Question Answering (LLeQA) dataset, comprising 1,868 expert-annotated legal questions in the French language, complete with detailed answers rooted in pertinent legal provisions. Our experimental results demonstrate promising performance on automatic evaluation metrics, but a qualitative analysis uncovers areas for refinement. As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models. \ No newline at end of file diff --git a/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds b/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds new file mode 100644 index 0000000000..a5debb9d55 --- /dev/null +++ b/data/2024/aaai/Interpretable3D: An Ad-Hoc Interpretable Classifier for 3D Point Clouds @@ -0,0 +1 @@ +3D decision-critical tasks urgently require research on explanations to ensure system reliability and transparency. Extensive explanatory research has been conducted on 2D images, but there is a lack in the 3D field. Furthermore, the existing explanations for 3D models are post-hoc and can be misleading, as they separate explanations from the original model. To address these issues, we propose an ad-hoc interpretable classifier for 3D point clouds (i.e., Interpretable3D). As an intuitive case-based classifier, Interpretable3D can provide reliable ad-hoc explanations without any embarrassing nuances. It allows users to understand how queries are embedded within past observations in prototype sets. Interpretable3D has two iterative training steps: 1) updating one prototype with the mean of the embeddings within the same sub-class in Prototype Estimation, and 2) penalizing or rewarding the estimated prototypes in Prototype Optimization. The mean of embeddings has a clear statistical meaning, i.e., class sub-centers. Moreover, we update prototypes with their most similar observations in the last few epochs. Finally, Interpretable3D classifies new samples according to prototypes. We evaluate the performance of Interpretable3D on four popular point cloud models: DGCNN, PointNet2, PointMLP, and PointNeXt. Our Interpretable3D demonstrates comparable or superior performance compared to softmax-based black-box models in the tasks of 3D shape classification and part segmentation. Our code is released at: github.com/FengZicai/Interpretable3D. \ No newline at end of file diff --git a/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract) b/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract) new file mode 100644 index 0000000000..b97fc7ba5e --- /dev/null +++ b/data/2024/aaai/Interpreting Temporal Knowledge Graph Reasoning (Student Abstract) @@ -0,0 +1 @@ +Temporal knowledge graph reasoning is an essential task that holds immense value in diverse real-world applications. Existing studies mainly focus on leveraging structural and sequential dependencies, excelling in tasks like entity and link prediction. However, they confront a notable interpretability gap in their predictions, a pivotal facet for comprehending model behavior. In this study, we propose an innovative method, LSGAT, which not only exhibits remarkable precision in entity predictions but also enhances interpretability by identifying pivotal historical events influencing event predictions. LSGAT enables concise explanations for prediction outcomes, offering valuable insights into the otherwise enigmatic "black box" reasoning process. Through an exploration of the implications of the most influential events, it facilitates a deeper understanding of the underlying mechanisms governing predictions. \ No newline at end of file diff --git a/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract) b/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract) new file mode 100644 index 0000000000..8a07611bdf --- /dev/null +++ b/data/2024/aaai/Intersection of Artificial Intelligence and Medical Education (Student Abstract) @@ -0,0 +1 @@ +Can advanced AI-driven technologies transform the traditionally arduous educational process in medicine? This study takes a deep dive into how the publicly available OpenAI ChatGPT-3.5 performs in answering board-style questions designed for physicians training to become pathologists. Correctly answering 75% of 543 questions using an engaging and fast-paced format was an impressive performance. It underscores the potential as well as improvement opportunities of using interactive AI in future medical training. \ No newline at end of file diff --git a/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems b/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems new file mode 100644 index 0000000000..86f688d6e3 --- /dev/null +++ b/data/2024/aaai/Intra- and Inter-group Optimal Transport for User-Oriented Fairness in Recommender Systems @@ -0,0 +1 @@ +Recommender systems are typically biased toward a small group of users, leading to severe unfairness in recommendation performance, i.e., User-Oriented Fairness (UOF) issue. Existing research on UOF exhibits notable limitations in two phases of recommendation models. In the training phase, current methods fail to tackle the root cause of the UOF issue, which lies in the unfair training process between advantaged and disadvantaged users. In the evaluation phase, the current UOF metric lacks the ability to comprehensively evaluate varying cases of unfairness. In this paper, we aim to address the aforementioned limitations and ensure recommendation models treat user groups of varying activity levels equally. In the training phase, we propose a novel Intra- and Inter-GrOup Optimal Transport framework (II-GOOT) to alleviate the data sparsity problem for disadvantaged users and narrow the training gap between advantaged and disadvantaged users. In the evaluation phase, we introduce a novel metric called ?-UOF, which enables the identification and assessment of various cases of UOF. This helps prevent recommendation models from leading to unfavorable fairness outcomes, where both advantaged and disadvantaged users experience subpar recommendation performance. We conduct extensive experiments on three real-world datasets based on four backbone recommendation models to prove the effectiveness of ?-UOF and the efficiency of our proposed II-GOOT. \ No newline at end of file diff --git a/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..6b88cf82d1 --- /dev/null +++ b/data/2024/aaai/Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignment in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to align their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using N individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and RA-CTDE. Extensive experiments on challenging tasks in SMAC, MPE, and GRF benchmarks showcase the improved performance of our method. \ No newline at end of file diff --git a/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution b/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution new file mode 100644 index 0000000000..6de3979a22 --- /dev/null +++ b/data/2024/aaai/Intrinsic Phase-Preserving Networks for Depth Super Resolution @@ -0,0 +1 @@ +Depth map super-resolution (DSR) plays an indispensable role in 3D vision. We discover an non-trivial spectral phenomenon: the components of high-resolution (HR) and low-resolution (LR) depth maps manifest the same intrinsic phase, and the spectral phase of RGB is a superset of them, which suggests that a phase-aware filter can assist in the precise use of RGB cues. Motivated by this, we propose an intrinsic phase-preserving DSR paradigm, named IPPNet, to fully exploit inter-modality collaboration in a mutually guided way. In a nutshell, a novel Phase-Preserving Filtering Module (PPFM) is developed to generate dynamic phase-aware filters according to the LR depth flow to filter out erroneous noisy components contained in RGB and then conduct depth enhancement via the modulation of the phase-preserved RGB signal. By stacking multiple PPFM blocks, the proposed IPPNet is capable of reaching a highly competitive restoration performance. Extensive experiments on various benchmark datasets, e.g., NYU v2, RGB-D-D, reach SOTA performance and also well demonstrate the validity of the proposed phase-preserving scheme. Code: https://github.com/neuralchen/IPPNet/. \ No newline at end of file diff --git a/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint) b/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint) new file mode 100644 index 0000000000..662c5c154c --- /dev/null +++ b/data/2024/aaai/Introduction to the Special Track on Artificial Intelligence and COVID-19 (Abstract Reprint) @@ -0,0 +1 @@ +The human race is facing one of the most meaningful public health emergencies in the modern era caused by the COVID-19 pandemic. This pandemic introduced various challenges, from lock-downs with significant economic costs to fundamentally altering the way of life for many people around the world. The battle to understand and control the virus is still at its early stages yet meaningful insights have already been made. The uncertainty of why some patients are infected and experience severe symptoms, while others are infected but asymptomatic, and others are not infected at all, makes managing this pandemic very challenging. Furthermore, the development of treatments and vaccines relies on knowledge generated from an ever evolving and expanding information space. Given the availability of digital data in the modern era, artificial intelligence (AI) is a meaningful tool for addressing the various challenges introduced by this unexpected pandemic. Some of the challenges include: outbreak prediction, risk modeling including infection and symptom development, testing strategy optimization, drug development, treatment repurposing, vaccine development, and others. \ No newline at end of file diff --git a/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization b/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization new file mode 100644 index 0000000000..5732a52fd2 --- /dev/null +++ b/data/2024/aaai/Invariant Random Forest: Tree-Based Model Solution for OOD Generalization @@ -0,0 +1 @@ +Out-Of-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention. \ No newline at end of file diff --git a/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning b/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning new file mode 100644 index 0000000000..afe98a2804 --- /dev/null +++ b/data/2024/aaai/Inverse Weight-Balancing for Deep Long-Tailed Learning @@ -0,0 +1 @@ +The performance of deep learning models often degrades rapidly when faced with imbalanced data characterized by a long-tailed distribution. Researchers have found that the fully connected layer trained by cross-entropy loss has large weight-norms for classes with many samples, but not for classes with few samples. How to address the data imbalance problem with both the encoder and the classifier seems an under-researched problem. In this paper, we propose an inverse weight-balancing (IWB) approach to guide model training and alleviate the data imbalance problem in two stages. In the first stage, an encoder and classifier (the fully connected layer) are trained using conventional cross-entropy loss. In the second stage, with a fixed encoder, the classifier is finetuned through an adaptive distribution for IWB in the decision space. Unlike existing inverse image frequency that implements a multiplicative margin adjustment transformation in the classification layer, our approach can be interpreted as an adaptive distribution alignment strategy using not only the class-wise number distribution but also the sample-wise difficulty distribution in both encoder and classifier. Experiments show that our method can greatly improve performance on imbalanced datasets such as CIFAR100-LT with different imbalance factors, ImageNet-LT, and iNaturelists2018. \ No newline at end of file diff --git a/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following b/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following new file mode 100644 index 0000000000..02e07550c0 --- /dev/null +++ b/data/2024/aaai/Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following @@ -0,0 +1 @@ +In this paper, we present our finding that prepending a Task-Agnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference. TAPP is different from canonical prompts for LLMs in that it is a fixed prompt prepended to the beginning of every input regardless of the target task for zero-shot generalization. We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average, respectively. This implies that the instruction-following ability of LLMs can be improved during inference time with a fixed prompt constructed with simple heuristics. We hypothesize that TAPP assists language models to better estimate the output distribution by focusing more on the instruction of the target task during inference. In other words, such ability does not seem to be sufficiently activated in not only base LLMs but also many instruction-fine-tuned LLMs. \ No newline at end of file diff --git a/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract) b/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract) new file mode 100644 index 0000000000..d16a8e8a80 --- /dev/null +++ b/data/2024/aaai/Investigation into Training Dynamics of Learned Optimizers (Student Abstract) @@ -0,0 +1 @@ +Modern machine learning heavily relies on optimization, and as deep learning models grow more complex and data-hungry, the search for efficient learning becomes crucial. Learned optimizers disrupt traditional handcrafted methods such as SGD and Adam by learning the optimization strategy itself, potentially speeding up training. However, the learned optimizers' dynamics are still not well understood. To remedy this, our work explores their optimization trajectories from the perspective of network architecture symmetries and proposed parameter update distributions. \ No newline at end of file diff --git a/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain b/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain new file mode 100644 index 0000000000..026c0ace9d --- /dev/null +++ b/data/2024/aaai/Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain @@ -0,0 +1 @@ +3D point cloud has been wildly used in security crucial domains, such as self-driving and 3D face recognition. Backdoor attack is a serious threat that usually destroy Deep Neural Networks (DNN) in the training stage. Though a few 3D backdoor attacks are designed to achieve guaranteed attack efficiency, their deformation will alarm human inspection. To obtain invisible backdoored point cloud, this paper proposes a novel 3D backdoor attack, named IBAPC, which generates backdoor trigger in the graph spectral domain. The effectiveness is grounded by the advantage of graph spectral signal that it can induce both global structure and local points to be responsible for the caused deformation in spatial domain. In detail, a new backdoor implanting function is proposed whose aim is to transform point cloud to graph spectral signal for conducting backdoor trigger. Then, we design a backdoor training procedure which updates the parameter of backdoor implanting function and victim 3D DNN alternately. Finally, the backdoored 3D DNN and its associated backdoor implanting function is obtained by finishing the backdoor training procedure. Experiment results suggest that IBAPC achieves SOTA attack stealthiness from three aspects including objective distance measurement, subjective human evaluation, graph spectral signal residual. At the same time, it obtains competitive attack efficiency. The code is available at https://github.com/f-lk/IBAPC. \ No newline at end of file diff --git a/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction? b/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction? new file mode 100644 index 0000000000..f2b21e6549 --- /dev/null +++ b/data/2024/aaai/Is a Large Language Model a Good Annotator for Event Extraction? @@ -0,0 +1 @@ +Event extraction is an important task in natural language processing that focuses on mining event-related information from unstructured text. Despite considerable advancements, it is still challenging to achieve satisfactory performance in this task, and issues like data scarcity and imbalance obstruct progress. In this paper, we introduce an innovative approach where we employ Large Language Models (LLMs) as expert annotators for event extraction. We strategically include sample data from the training dataset in the prompt as a reference, ensuring alignment between the data distribution of LLM-generated samples and that of the benchmark dataset. This enables us to craft an augmented dataset that complements existing benchmarks, alleviating the challenges of data imbalance and scarcity and thereby enhancing the performance of fine-tuned models. We conducted extensive experiments to validate the efficacy of our proposed method, and we believe that this approach holds great potential for propelling the development and application of more advanced and reliable event extraction systems in real-world scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery b/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery new file mode 100644 index 0000000000..2c1acdbb04 --- /dev/null +++ b/data/2024/aaai/Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery @@ -0,0 +1,3 @@ +Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NP-hard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods. +However, most of those iterative methods are based on the l1 norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored. +To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the k-support norm regularizer rather than the l1 norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with l1 norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix. \ No newline at end of file diff --git a/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution b/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution new file mode 100644 index 0000000000..65af8593cf --- /dev/null +++ b/data/2024/aaai/Iterative Token Evaluation and Refinement for Real-World Super-resolution @@ -0,0 +1 @@ +Real-world image super-resolution (RWSR) is a long-standing problem as low-quality (LQ) images often have complex and unidentified degradations. Existing methods such as Generative Adversarial Networks (GANs) or continuous diffusion models present their own issues including GANs being difficult to train while continuous diffusion models requiring numerous inference steps. In this paper, we propose an Iterative Token Evaluation and Refinement (ITER) framework for RWSR, which utilizes a discrete diffusion model operating in the discrete token representation space, i.e., indexes of features extracted from a VQGAN codebook pre-trained with high-quality (HQ) images. We show that ITER is easier to train than GANs and more efficient than continuous diffusion models. Specifically, we divide RWSR into two sub-tasks, i.e., distortion removal and texture generation. Distortion removal involves simple HQ token prediction with LQ images, while texture generation uses a discrete diffusion model to iteratively refine the distortion removal output with a token refinement network. In particular, we propose to include a token evaluation network in the discrete diffusion process. It learns to evaluate which tokens are good restorations and helps to improve the iterative refinement results. Moreover, the evaluation network can first check status of the distortion removal output and then adaptively select total refinement steps needed, thereby maintaining a good balance between distortion removal and texture generation. Extensive experimental results show that ITER is easy to train and performs well within just 8 iterative steps. \ No newline at end of file diff --git a/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract) b/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract) new file mode 100644 index 0000000000..78ca869007 --- /dev/null +++ b/data/2024/aaai/JoLT: Jointly Learned Representations of Language and Time-Series for Clinical Time-Series Interpretation (Student Abstract) @@ -0,0 +1 @@ +Time-series and text data are prevalent in healthcare and frequently co-exist, yet they are typically modeled in isolation. Even studies that jointly model time-series and text, do so by converting time-series to images or graphs. We hypothesize that explicitly modeling time-series jointly with text can improve tasks such as summarization and question answering for time-series data, which have received little attention so far. To address this gap, we introduce JoLT to jointly learn desired representations from pre-trained time-series and text models. JoLT utilizes a Querying Transformer (Q-Former) to align the time-series and text representations. Our experiments on a large real-world electrocardiography dataset for medical time-series summarization show that JoLT outperforms state-of-the-art image captioning approaches. \ No newline at end of file diff --git a/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera b/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera new file mode 100644 index 0000000000..8394f63e41 --- /dev/null +++ b/data/2024/aaai/Joint Demosaicing and Denoising for Spike Camera @@ -0,0 +1 @@ +As a neuromorphic camera with high temporal resolution, spike camera can capture dynamic scenes with high-speed motion. Recently, spike camera with a color filter array (CFA) has been developed for color imaging. There are some methods for spike camera demosaicing to reconstruct color images from Bayer-pattern spike streams. However, the demosaicing results are bothered by severe noise in spike streams, to which previous works pay less attention. In this paper, we propose an iterative joint demosaicing and denoising network (SJDD-Net) for spike cameras based on the observation model. Firstly, we design a color spike representation (CSR) to learn latent representation from Bayer-pattern spike streams. In CSR, we propose an offset-sharing deformable convolution module to align temporal features of color channels. Then we develop a spike noise estimator (SNE) to obtain features of the noise distribution. Finally, a color correlation prior (CCP) module is proposed to utilize the color correlation for better details. For training and evaluation, we designed a spike camera simulator to generate Bayer-pattern spike streams with synthesized noise. Besides, we captured some Bayer-pattern spike streams, building the first real-world captured dataset to our knowledge. Experimental results show that our method can restore clean images from Bayer-pattern spike streams. The source codes and dataset are available at https://github.com/csycdong/SJDD-Net. \ No newline at end of file diff --git a/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification b/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification new file mode 100644 index 0000000000..c939b012e5 --- /dev/null +++ b/data/2024/aaai/Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification @@ -0,0 +1 @@ +Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracies of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet. \ No newline at end of file diff --git a/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization b/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization new file mode 100644 index 0000000000..54626d7fb0 --- /dev/null +++ b/data/2024/aaai/Jointly Improving the Sample and Communication Complexities in Decentralized Stochastic Minimax Optimization @@ -0,0 +1 @@ +We propose a novel single-loop decentralized algorithm, DGDA-VR, for solving the stochastic nonconvex strongly-concave minimax problems over a connected network of agents, which are equipped with stochastic first-order oracles to estimate their local gradients. DGDA-VR, incorporating variance reduction, achieves O(ε^−3) oracle complexity and O(ε^−2) communication complexity without resorting to multi-communication rounds – both are optimal, i.e., matching the lower bounds for this class of problems. Since DGDA-VR does not require multiple communication rounds, it is applicable to a broader range of decentralized computational environments. To the best of our knowledge, this is the first distributed method using a single communication round in each iteration to jointly optimize the oracle and communication complexities for the problem considered here. \ No newline at end of file diff --git a/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification b/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification new file mode 100644 index 0000000000..0eae1aac04 --- /dev/null +++ b/data/2024/aaai/Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification @@ -0,0 +1 @@ +Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics. \ No newline at end of file diff --git a/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons b/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons new file mode 100644 index 0000000000..cd9ca2d5fa --- /dev/null +++ b/data/2024/aaai/Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons @@ -0,0 +1 @@ +Pre-trained language models (PLMs) contain vast amounts of factual knowledge, but how the knowledge is stored in the parameters remains unclear. This paper delves into the complex task of understanding how factual knowledge is stored in multilingual PLMs, and introduces the Architecture-adapted Multilingual Integrated Gradients method, which successfully localizes knowledge neurons more precisely compared to current methods, and is more universal across various architectures and languages. Moreover, we conduct an in-depth exploration on knowledge neurons, leading to the following two important discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which store factual knowledge in a form that transcends language. We design cross-lingual knowledge editing experiments, demonstrating that the PLMs can accomplish this task based on language-independent neurons; (2) The discovery of Degenerate Knowledge Neurons, a novel type of neuron showing that different knowledge neurons can store the same fact. Its property of functional overlap endows the PLMs with a robust mastery of factual knowledge. We design fact-checking experiments, proving that the degenerate knowledge neurons can help the PLMs to detect wrong facts. Experiments corroborate these findings, shedding light on the mechanisms of factual knowledge storage in multilingual PLMs, and contribute valuable insights to the field. The code is available at https://github.com/heng840/AMIG. \ No newline at end of file diff --git a/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning b/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning new file mode 100644 index 0000000000..8dfc05d160 --- /dev/null +++ b/data/2024/aaai/KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning @@ -0,0 +1 @@ +Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing b/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing new file mode 100644 index 0000000000..d8942d7885 --- /dev/null +++ b/data/2024/aaai/KAMEL: Knowledge Aware Medical Entity Linkage to Automate Health Insurance Claims Processing @@ -0,0 +1 @@ +Automating the processing of health insurance claims to achieve "Straight-Through Processing" is one of the holy grails that all insurance companies aim to achieve. One of the major impediments to this automation is the difficulty in establishing the relationship between the underwriting exclusions that a policy has and the incoming claim's diagnosis information. Typically, policy underwriting exclusions are captured in free-text such as "Respiratory illnesses are excluded due to a pre-existing asthma condition". A medical claim coming from a hospital would have the diagnosis represented using the International Classification of Disease (ICD) codes from the World Health Organization. The complex and labour-intensive task of establishing the relationship between free-text underwriting exclusions in health insurance policies and medical diagnosis codes from health insurance claims is critical towards determining if a claim should be rejected due to underwriting exclusions. In this work, we present a novel framework that leverages both explicit and implicit domain knowledge present in medical ontologies and pre-trained language models respectively, to effectively establish the relationship between free-text describing medical conditions present in underwriting exclusions and the ICD-10CM diagnosis codes in health insurance claims. Termed KAMEL (Knowledge Aware Medical Entity Linkage), our proposed framework addresses the limitations faced by prior approaches when evaluated on real-world health insurance claims data. Our proposed framework have been deployed in several multi-national health insurance providers to automate their health insurance claims. \ No newline at end of file diff --git a/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs b/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs new file mode 100644 index 0000000000..371985bdfd --- /dev/null +++ b/data/2024/aaai/KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge Graphs @@ -0,0 +1 @@ +Treatment effect estimation (TEE) is the task of determining the impact of various treatments on patient outcomes. Current TEE methods fall short due to reliance on limited labeled data and challenges posed by sparse and high-dimensional observational patient data. To address the challenges, we introduce a novel pre-training and fine-tuning framework, KG-TREAT, which synergizes large-scale observational patient data with biomedical knowledge graphs (KGs) to enhance TEE. Unlike previous approaches, KG-TREAT constructs dual-focus KGs and integrates a deep bi-level attention synergy method for in-depth information fusion, enabling distinct encoding of treatment-covariate and outcome-covariate relationships. KG-TREAT also incorporates two pre-training tasks to ensure a thorough grounding and contextualization of patient data and KGs. Evaluation on four downstream TEE tasks shows KG-TREAT's superiority over existing methods, with an average improvement of 7% in Area under the ROC Curve (AUC) and 9% in Influence Function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). The effectiveness of our estimated treatment effects is further affirmed by alignment with established randomized clinical trial findings. \ No newline at end of file diff --git a/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding b/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding new file mode 100644 index 0000000000..b24ca9d4bb --- /dev/null +++ b/data/2024/aaai/KGDM: A Diffusion Model to Capture Multiple Relation Semantics for Knowledge Graph Embedding @@ -0,0 +1 @@ +Knowledge graph embedding (KGE) is an efficient and scalable method for knowledge graph completion. However, most existing KGE methods suffer from the challenge of multiple relation semantics, which often degrades their performance. This is because most KGE methods learn fixed continuous vectors for entities (relations) and make deterministic entity predictions to complete the knowledge graph, which hardly captures multiple relation semantics. To tackle this issue, previous works try to learn complex probabilistic embeddings instead of fixed embeddings but suffer from heavy computational complexity. In contrast, this paper proposes a simple yet efficient framework namely the Knowledge Graph Diffusion Model (KGDM) to capture the multiple relation semantics in prediction. Its key idea is to cast the problem of entity prediction into conditional entity generation. Specifically, KGDM estimates the probabilistic distribution of target entities in prediction through Denoising Diffusion Probabilistic Models (DDPM). To bridge the gap between continuous diffusion models and discrete KGs, two learnable embedding functions are defined to map entities and relation to continuous vectors. To consider connectivity patterns of KGs, a Conditional Entity Denoiser model is introduced to generate target entities conditioned on given entities and relations. Extensive experiments demonstrate that KGDM significantly outperforms existing state-of-the-art methods in three benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding b/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding new file mode 100644 index 0000000000..e5b9607deb --- /dev/null +++ b/data/2024/aaai/KGTS: Contrastive Trajectory Similarity Learning over Prompt Knowledge Graph Embedding @@ -0,0 +1 @@ +Trajectory similarity computation serves as a fundamental functionality of various spatial information applications. Although existing deep learning similarity computation methods offer better efficiency and accuracy than non-learning solutions, they are still immature in trajectory embedding and suffer from poor generality and heavy preprocessing for training. Targeting these limitations, we propose a novel framework named KGTS based on knowledge graph grid embedding, prompt trajectory embedding, and unsupervised contrastive learning for improved trajectory similarity computation. Specifically, we first embed map grids with a GRot embedding method to vigorously grasp the neighbouring relations of grids. Then, a prompt trajectory embedding network incorporates the resulting grid embedding and extracts trajectory structure and point order information. It is trained by unsupervised contrastive learning, which not only alleviates the heavy preprocessing burden but also provides exceptional generality with creatively designed strategies for positive sample generation. The prompt trajectory embedding adopts a customized prompt paradigm to mitigate the gap between the grid embedding and the trajectory embedding. Extensive experiments on two real-world trajectory datasets demonstrate the superior performance of KGTS over state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking b/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking new file mode 100644 index 0000000000..a1ca1e3c2b --- /dev/null +++ b/data/2024/aaai/KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking @@ -0,0 +1 @@ +Our life is populated with articulated objects. Current category-level articulation estimation works largely focus on predicting part-level 6D poses on static point cloud observations. In this paper, we tackle the problem of category-level online robust and real-time 6D pose tracking of articulated objects, where we propose KPA-Tracker, a novel 3D KeyPoint based Articulated object pose Tracker. Given an RGB-D image or a partial point cloud at the current frame as well as the estimated per-part 6D poses from the last frame, our KPA-Tracker can effectively update the poses with learned 3D keypoints between the adjacent frames. Specifically, we first canonicalize the input point cloud and formulate the pose tracking as an inter-frame pose increment estimation task. To learn consistent and separate 3D keypoints for every rigid part, we build KPA-Gen that outputs the high-quality ordered 3D keypoints in an unsupervised manner. During pose tracking on the whole video, we further propose a keypoint-based articulation tracking algorithm that mines keyframes as reference for accurate pose updating. We provide extensive experiments on validating our KPA-Tracker on various datasets ranging from synthetic point cloud observation to real-world scenarios, which demonstrates the superior performance and robustness of the KPA-Tracker. We believe that our work has the potential to be applied in many fields including robotics, embodied intelligence and augmented reality. All the datasets and codes are available at https://github.com/hhhhhar/KPA-Tracker. \ No newline at end of file diff --git a/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching b/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching new file mode 100644 index 0000000000..a77d9f47c4 --- /dev/null +++ b/data/2024/aaai/KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching @@ -0,0 +1 @@ +Dual-lens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and directly perform dense matching between the LR input and Ref. Due to the resolution gap between LR and Ref, the matching may miss the best-matched candidate and destroy the consistent structures in the overlapped FoV area. Different from them, we propose to first align the Ref with the center region (namely the overlapped FoV area) of the LR input by combining global warping and local warping to make the aligned Ref be sharp and consistent. Then, we formulate the aligned Ref and LR center as value-key pairs, and the corner region of the LR is formulated as queries. In this way, we propose a kernel-free matching strategy by matching between the LR-corner (query) and LR-center (key) regions, and the corresponding aligned Ref (value) can be warped to the corner region of the target. Our kernel-free matching strategy avoids the resolution gap between LR and Ref, which makes our network have better generalization ability. In addition, we construct a DuSR-Real dataset with (LR, Ref, HR) triples, where the LR and HR are well aligned. Experiments on three datasets demonstrate that our method outperforms the second-best method by a large margin. Our code and dataset are available at https://github.com/ZifanCui/KeDuSR. \ No newline at end of file diff --git a/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning b/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning new file mode 100644 index 0000000000..496caa005f --- /dev/null +++ b/data/2024/aaai/Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning @@ -0,0 +1 @@ +Explaining predictions of black-box neural networks is crucial when applied to decision-critical tasks. Thus, attribution maps are commonly used to identify important image regions, despite prior work showing that humans prefer explanations based on similar examples. To this end, ProtoPNet learns a set of class-representative feature vectors (prototypes) for case-based reasoning. During inference, similarities of latent features to prototypes are linearly classified to form predictions and attribution maps are provided to explain the similarity. In this work, we evaluate whether architectures for case-based reasoning fulfill established axioms required for faithful explanations using the example of ProtoPNet. We show that such architectures allow the extraction of faithful explanations. However, we prove that the attribution maps used to explain the similarities violate the axioms. We propose a new procedure to extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually, these explanations are Shapley values, calculated on the similarity scores of each prototype. They allow to faithfully answer which prototypes are present in an unseen image and quantify each pixel’s contribution to that presence, thereby complying with all axioms. The theoretical violations of ProtoPNet manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs, RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50, ResNeXt50). Our experiments show a qualitative difference between the explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the explanations with the Area Over the Perturbation Curve, on which ProtoPFaith outperforms ProtoPNet on all experiments by a factor >10^3. \ No newline at end of file diff --git a/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract) b/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract) new file mode 100644 index 0000000000..305536a555 --- /dev/null +++ b/data/2024/aaai/Kepler Light Curve Classification Using Deep Learning and Markov Transition Field (Student Abstract) @@ -0,0 +1,13 @@ +An exoplanet is a planet, which is not a part of our solar system. +Whether life exists in one or more of these exoplanets +has fascinated humans for centuries. NASA’s Kepler Space +Telescope has discovered more than 70% of known exoplanets +in our universe. However, manually determining whether a +Kepler light curve indicates an exoplanet or not becomes infeasible +with the large volume of data. Due to this, we propose +a deep learning-based strategy to automatically classify +a Kepler light curve. More specifically, we first convert the +light curve time series into its corresponding Markov Transition +Field (MTF) image and then classify it. Results show +that the accuracy of the proposed technique is 99.39%, which +is higher than all current state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization b/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization new file mode 100644 index 0000000000..6cc8dd6f05 --- /dev/null +++ b/data/2024/aaai/Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization @@ -0,0 +1 @@ +In this paper, we study the problem of estimating the normalizing constant through queries to the black-box function f, which is the integration of the exponential function of f scaled by a problem parameter lambda. We assume f belongs to a reproducing kernel Hilbert space (RKHS), and show that to estimate the normalizing constant within a small relative error, the level of difficulty depends on the value of lambda: When lambda approaches zero, the problem is similar to Bayesian quadrature (BQ), while when lambda approaches infinity, the problem is similar to Bayesian optimization (BO). More generally, the problem varies between BQ and BO. We find that this pattern holds true even when the function evaluations are noisy, bringing new aspects to this topic. Our findings are supported by both algorithm-independent lower bounds and algorithmic upper bounds, as well as simulation studies conducted on a variety of benchmark functions. \ No newline at end of file diff --git a/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation b/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation new file mode 100644 index 0000000000..fdddbca153 --- /dev/null +++ b/data/2024/aaai/Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation @@ -0,0 +1 @@ +Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixel-wise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving b/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving new file mode 100644 index 0000000000..8eb66b0b1b --- /dev/null +++ b/data/2024/aaai/Knowledge Distillation from Single-Task Teachers to Multi-Task Student for End-to-End Autonomous Driving @@ -0,0 +1 @@ +In the domain of end-to-end autonomous driving, conventional sensor fusion techniques exhibit inadequacies, particularly when facing challenging scenarios with numerous dynamic agents. Imitation learning hampers the performance by the expert and encounters issues with out-of-distribution challenges. To overcome these limitations, we propose a transformer-based algorithm designed to fuse diverse representations from RGB-D cameras through knowledge distillation. This approach leverages insights from multi-task teachers to enhance the learning capabilities of single-task students, particularly in a Reinforcement Learning (RL) setting. Our model consists of two primary modules: the perception module, responsible for encoding observation data acquired from RGB-D cameras and performing tasks such as semantic segmentation, semantic depth cloud mapping (SDC), ego vehicle speed estimation, and traffic light state recognition. Subsequently, the control module decodes these features, incorporating additional data, including a rough simulator for static and dynamic environments, to anticipate waypoints within a latent feature space. Vehicular controls (e.g., steering, throttle, and brake) are obtained directly from measurement features and environmental states using the RL agent and are further refined by a PID algorithm that dynamically follows waypoints. The model undergoes rigorous evaluation and comparative analysis on the CARLA simulator across various scenarios, encompassing normal to adversarial conditions. Our code is available at https://github.com/pagand/e2etransfuser/ to facilitate future studies. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery b/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery new file mode 100644 index 0000000000..d137c24c71 --- /dev/null +++ b/data/2024/aaai/Knowledge Enhanced Representation Learning for Drug Discovery @@ -0,0 +1 @@ +Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. Our study reveals that enhanced representations, derived from multimodal knowledge graphs describing relations among molecules and proteins, lead to state-of-the-art results in well-established benchmarks (first place in the leaderboard for Therapeutics Data Commons benchmark ``Drug-Target Interaction Domain Generalization Benchmark", with an improvement of 8 points with respect to previous best result). Moreover, our results significantly surpass those achieved in standard benchmarks by using conventional pre-trained representations that rely only on sequence or SMILES data. We release our multimodal knowledge graphs, integrating data from seven public data sources, and which contain over 30 million triples. Pretrained models from our proposed graphs and benchmark task source code are also released. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption b/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption new file mode 100644 index 0000000000..169e1413aa --- /dev/null +++ b/data/2024/aaai/Knowledge Graph Error Detection with Contrastive Confidence Adaption @@ -0,0 +1 @@ +Knowledge graphs (KGs) often contain various errors. Previous works on detecting errors in KGs mainly rely on triplet embedding from graph structure. We conduct an empirical study and find that these works struggle to discriminate noise from semantically-similar correct triplets. In this paper, we propose a KG error detection model CCA to integrate both textual and graph structural information from triplet reconstruction for better distinguishing semantics. We design interactive contrastive learning to capture the differences between textual and structural patterns. Furthermore, we construct realistic datasets with semantically-similar noise and adversarial noise. Experimental results demonstrate that CCA outperforms state-of-the-art baselines, especially on semantically-similar noise and adversarial noise. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos b/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos new file mode 100644 index 0000000000..6a84e170f6 --- /dev/null +++ b/data/2024/aaai/Knowledge Guided Semi-supervised Learning for Quality Assessment of User Generated Videos @@ -0,0 +1 @@ +Perceptual quality assessment of user generated content (UGC) videos is challenging due to the requirement of large scale human annotated videos for training. In this work, we address this challenge by first designing a self-supervised Spatio-Temporal Visual Quality Representation Learning (ST-VQRL) framework to generate robust quality aware features for videos. Then, we propose a dual-model based Semi Supervised Learning (SSL) method specifically designed for the Video Quality Assessment (SSL-VQA) task, through a novel knowledge transfer of quality predictions between the two models. Our SSL-VQA method uses the ST-VQRL backbone to produce robust performances across various VQA datasets including cross-database settings, despite being learned with limited human annotated videos. Our model improves the state-of-the-art performance when trained only with limited data by around 10%, and by around 15% when unlabelled data is also used in SSL. Source codes and checkpoints are available at https://github.com/Shankhanil006/SSL-VQA. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract) b/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract) new file mode 100644 index 0000000000..5ac9b1bbff --- /dev/null +++ b/data/2024/aaai/Knowledge Transfer via Compact Model in Federated Learning (Student Abstract) @@ -0,0 +1 @@ +Communication overhead remains a significant challenge in federated learning due to frequent global model updates. Essentially, the update of the global model can be viewed as knowledge transfer. We aim to transfer more knowledge through a compact model while reducing communication overhead. In our study, we introduce a federated learning framework where clients pre-train large models locally and the server initializes a compact model to communicate. This compact model should be light in size but still have enough knowledge to refine the global model effectively. We facilitate the knowledge transfer from local to global models based on pre-training outcomes. Our experiments show that our approach significantly reduce communication overhead without sacrificing accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation b/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation new file mode 100644 index 0000000000..a4d773eb75 --- /dev/null +++ b/data/2024/aaai/Knowledge-Aware Explainable Reciprocal Recommendation @@ -0,0 +1 @@ +Reciprocal recommender systems (RRS) have been widely used in online platforms such as online dating and recruitment. They can simultaneously fulfill the needs of both parties involved in the recommendation process. Due to the inherent nature of the task, interaction data is relatively sparse compared to other recommendation tasks. Existing works mainly address this issue through content-based recommendation methods. However, these methods often implicitly model textual information from a unified perspective, making it challenging to capture the distinct intentions held by each party, which further leads to limited performance and the lack of interpretability. In this paper, we propose a Knowledge-Aware Explainable Reciprocal Recommender System (KAERR), which models metapaths between two parties independently, considering their respective perspectives and requirements. Various metapaths are fused using an attention-based mechanism, where the attention weights unveil dual-perspective preferences and provide recommendation explanations for both parties. Extensive experiments on two real-world datasets from diverse scenarios demonstrate that the proposed model outperforms state-of-the-art baselines, while also delivering compelling reasons for recommendations to both parties. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification b/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification new file mode 100644 index 0000000000..b3de82a2ec --- /dev/null +++ b/data/2024/aaai/Knowledge-Aware Neuron Interpretation for Scene Classification @@ -0,0 +1 @@ +Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semantically-equivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning b/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning new file mode 100644 index 0000000000..2fcf2e5473 --- /dev/null +++ b/data/2024/aaai/Knowledge-Aware Parameter Coaching for Personalized Federated Learning @@ -0,0 +1 @@ +Personalized Federated Learning (pFL) can effectively exploit the non-IID data from distributed clients by customizing personalized models. Existing pFL methods either simply take the local model as a whole for aggregation or require significant training overhead to induce the inter-client personalized weights, and thus clients cannot efficiently exploit the mutually relevant knowledge from each other. In this paper, we propose a knowledge-aware parameter coaching scheme where each client can swiftly and granularly refer to parameters of other clients to guide the local training, whereby accurate personalized client models can be efficiently produced without contradictory knowledge. Specifically, a novel regularizer is designed to conduct layer-wise parameters coaching via a relation cube, which is constructed based on the knowledge represented by the layered parameters among all clients. Then, we develop an optimization method to update the relation cube and the parameters of each client. It is theoretically demonstrated that the convergence of the proposed method can be guaranteed under both convex and non-convex settings. Extensive experiments are conducted over various datasets, which show that the proposed method can achieve better performance compared with the state-of-the-art baselines in terms of accuracy and convergence speed. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition b/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition new file mode 100644 index 0000000000..2d965ca497 --- /dev/null +++ b/data/2024/aaai/Knowledge-Enhanced Historical Document Segmentation and Recognition @@ -0,0 +1 @@ +Optical Character Recognition (OCR) of historical document images remains a challenging task because of the distorted input images, extensive number of uncommon characters, and the scarcity of labeled data, which impedes modern deep learning-based OCR techniques from achieving good recognition accuracy. Meanwhile, there exists a substantial amount of expert knowledge that can be utilized in this task. However, such knowledge is usually complicated and could only be accurately expressed with formal languages such as first-order logic (FOL), which is difficult to be directly integrated into deep learning models. This paper proposes KESAR, a novel Knowledge-Enhanced Document Segmentation And Recognition method for historical document images based on the Abductive Learning (ABL) framework. The segmentation and recognition models are enhanced by incorporating background knowledge for character extraction and prediction, followed by an efficient joint optimization of both models. We validate the effectiveness of KESAR on historical document datasets. The experimental results demonstrate that our method can simultaneously utilize knowledge-driven reasoning and data-driven learning, which outperforms the current state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint b/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint new file mode 100644 index 0000000000..2599e3ce08 --- /dev/null +++ b/data/2024/aaai/Knowledge-Powered Recommendation for an Improved Diet Water Footprint @@ -0,0 +1 @@ +According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, two-thirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming to facilitate sustainable and healthy food consumption. The engine recommends ingredient substitutes in user recipes that improve nutritional value and reduce environmental impact, particularly water footprint. The system architecture includes source identification, information extraction, schema alignment, knowledge graph construction, and user interface development. The research offers a promising tool for promoting healthier eating habits and contributing to water conservation efforts. \ No newline at end of file diff --git a/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation b/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation new file mode 100644 index 0000000000..8b1b055b2e --- /dev/null +++ b/data/2024/aaai/Kumaraswamy Wavelet for Heterophilic Scene Graph Generation @@ -0,0 +1 @@ +Graph neural networks (GNNs) has demonstrated its capabilities in the field of scene graph generation (SGG) by updating node representations from neighboring nodes. Actually it can be viewed as a form of low-pass filter in the spatial domain, which smooths node feature representation and retains commonalities among nodes. However, spatial GNNs does not work well in the case of heterophilic SGG in which fine-grained predicates are always connected to a large number of coarse-grained predicates. Blind smoothing undermines the discriminative information of the fine-grained predicates, resulting in failure to predict them accurately. To address the heterophily, our key idea is to design tailored filters by wavelet transform from the spectral domain. First, we prove rigorously that when the heterophily on the scene graph increases, the spectral energy gradually shifts towards the high-frequency part. Inspired by this observation, we subsequently propose the Kumaraswamy Wavelet Graph Neural Network (KWGNN). KWGNN leverages complementary multi-group Kumaraswamy wavelets to cover all frequency bands. Finally, KWGNN adaptively generates band-pass filters and then integrates the filtering results to better accommodate varying levels of smoothness on the graph. Comprehensive experiments on the Visual Genome and Open Images datasets show that our method achieves state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation b/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation new file mode 100644 index 0000000000..f077a31295 --- /dev/null +++ b/data/2024/aaai/LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation @@ -0,0 +1 @@ +Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings leads to local structural and the whole graph information missing. To address these challenges, we propose a novel link aware fusion and aggregation based multimodal knowledge graph completion model named LAFA, which is composed of link aware fusion module and link aware aggregation module. The link aware fusion module alleviates noise of irrelevant visual information by calculating the importance between an entity and its associated images in different link scenarios, and fuses the visual and structural embeddings according to the importance through our proposed modality embedding fusion mechanism. The link aware aggregation module assigns neighbor structural information to a given central entity by calculating the importance between the entity and its neighbors, and aggregating the fused embeddings through linear combination according to the importance. Extensive experiments on standard datasets validate that LAFA can obtain state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning b/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning new file mode 100644 index 0000000000..5ebc915bcd --- /dev/null +++ b/data/2024/aaai/LAMM: Label Alignment for Multi-Modal Prompt Learning @@ -0,0 +1 @@ +With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM. \ No newline at end of file diff --git a/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training b/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training new file mode 100644 index 0000000000..4555c7a40b --- /dev/null +++ b/data/2024/aaai/LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training @@ -0,0 +1 @@ +Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT (Low-rank Adaptation for Multilingual Paraphrasing using Adversarial Training), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/phkhanhtrinh23/LAMPAT. \ No newline at end of file diff --git a/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models b/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models new file mode 100644 index 0000000000..0abc1d2bd4 --- /dev/null +++ b/data/2024/aaai/LDMVFI: Video Frame Interpolation with Latent Diffusion Models @@ -0,0 +1 @@ +Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and ground-truth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI. \ No newline at end of file diff --git a/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities b/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities new file mode 100644 index 0000000000..b3047d35a4 --- /dev/null +++ b/data/2024/aaai/LDS2AE: Local Diffusion Shared-Specific Autoencoder for Multimodal Remote Sensing Image Classification with Arbitrary Missing Modalities @@ -0,0 +1 @@ +Recent research on the joint classification of multimodal remote sensing data has achieved great success. However, due to the limitations imposed by imaging conditions, the case of missing modalities often occurs in practice. Most previous researchers regard the classification in case of different missing modalities as independent tasks. They train a specific classification model for each fixed missing modality by extracting multimodal joint representation, which cannot handle the classification of arbitrary (including multiple and random) missing modalities. In this work, we propose a local diffusion shared-specific autoencoder (LDS2AE), which solves the classification of arbitrary missing modalities with a single model. The LDS2AE captures the data distribution of different modalities to learn multimodal shared feature for classification by designing a novel local diffusion autoencoder which consists of a modality-shared encoder and several modality-specific decoders. The modality-shared encoder is designed to extract multimodal shared feature by employing the same parameters to map multimodal data into a shared subspace. The modality-specific decoders put the multimodal shared feature to reconstruct the image of each modality, which facilitates the shared feature to learn unique information of different modalities. In addition, we incorporate masked training to the diffusion autoencoder to achieve local diffusion, which significantly reduces the training cost of model. The approach is tested on widely-used multimodal remote sensing datasets, demonstrating the effectiveness of the proposed LDS2AE in addressing the classification of arbitrary missing modalities. The code is available at https://github.com/Jiahuiqu/LDS2AE. \ No newline at end of file diff --git a/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation b/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation new file mode 100644 index 0000000000..deae0c9211 --- /dev/null +++ b/data/2024/aaai/LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation @@ -0,0 +1 @@ +A fundamental task in the realms of computer vision, Low-Rank Matrix Recovery (LRMR) focuses on the inherent low-rank structure precise recovery from incomplete data and/or corrupted measurements given that the rank is a known prior or accurately estimated. However, it remains challenging for existing rank estimation methods to accurately estimate the rank of an ill-conditioned matrix. Also, existing LRMR optimization methods are heavily dependent on the chosen parameters, and are therefore difficult to adapt to different situations. Addressing these issues, A novel LEarning-based low-rank matrix recovery with Rank Estimation (LERE) is proposed. More specifically, considering the characteristics of the Gerschgorin disk's center and radius, a new heuristic decision rule in the Gerschgorin Disk Theorem is significantly enhanced and the low-rank boundary can be exactly located, which leads to a marked improvement in the accuracy of rank estimation. According to the estimated rank, we select row and column sub-matrices from the observation matrix by uniformly random sampling. A 17-iteration feedforward-recurrent-mixed neural network is then adapted to learn the parameters in the sub-matrix recovery processing. Finally, by the correlation of the row sub-matrix and column sub-matrix, LERE successfully recovers the underlying low-rank matrix. Overall, LERE is more efficient and robust than existing LRMR methods. Experimental results demonstrate that LERE surpasses state-of-the-art (SOTA) methods. The code for this work is accessible at https://github.com/zhengqinxu/LERE. \ No newline at end of file diff --git a/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition b/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition new file mode 100644 index 0000000000..cbb383cefd --- /dev/null +++ b/data/2024/aaai/LERMO: A Novel Web Game for AI-Enhanced Sign Language Recognition @@ -0,0 +1 @@ +Sign language is a visual and gestural communication system used by deaf and hearing-impaired people. Despite numerous deep learning methods proposed for automatic interpretation, a gap persists in developing applications that effectively utilize these models for assisting sign language studies and inclusion. We introduce LERMO (https://lermo.app/), a web game merging machine learning and gamification to enhance sign language fingerspelling. Inspired by Wordle™, LERMO offers an interactive word-guessing game where users can play using a video camera. We create a new dataset of labeled landmark fingerspelling and design our model to ensure optimal speed and efficiency to run on a web browser. We survey approximately 40 users, which find LERMO user-friendly and innovative. From those, 95% believe LERMO could be used to enhance fingerspelling skills. \ No newline at end of file diff --git a/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition b/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition new file mode 100644 index 0000000000..372f46bb2d --- /dev/null +++ b/data/2024/aaai/LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition @@ -0,0 +1 @@ +The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git. \ No newline at end of file diff --git a/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation b/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation new file mode 100644 index 0000000000..8100ece8e5 --- /dev/null +++ b/data/2024/aaai/LGMRec: Local and Global Graph Learning for Multimodal Recommendation @@ -0,0 +1 @@ +The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests. \ No newline at end of file diff --git a/data/2024/aaai/LION: Implicit Vision Prompt Tuning b/data/2024/aaai/LION: Implicit Vision Prompt Tuning new file mode 100644 index 0000000000..9a84335de1 --- /dev/null +++ b/data/2024/aaai/LION: Implicit Vision Prompt Tuning @@ -0,0 +1,5 @@ +Despite recent promising performances across a range of vision tasks, vision Transformers still have an issue of high computational costs. +Recently, vision prompt learning has provided an economical solution to this problem without fine-tuning the whole large-scale model. +However, the efficiency and effectiveness of existing models are still far from satisfactory due to the parameter cost of extensive prompt blocks and tricky prompt framework designs. +In this paper, we propose a light-weight prompt framework named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable low memory costs for various complex tasks. +In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained backbone with parameters frozen. Moreover, according to the lottery hypothesis, we further prune the parameters to relieve the computation burden in implicit layers. Various experiments have validated that our LION obtains promising performances on a wide range of datasets. Most importantly, LION reduces up to 11.5 % of training parameter numbers while obtaining higher performance than the state-of-the-art VPT, especially under challenging scenes. Furthermore, we find that our proposed LION has an excellent generalization performance, making it an easy way to boost transfer learning in the future. \ No newline at end of file diff --git a/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model b/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model new file mode 100644 index 0000000000..be74780867 --- /dev/null +++ b/data/2024/aaai/LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model @@ -0,0 +1 @@ +Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection. \ No newline at end of file diff --git a/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios b/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios new file mode 100644 index 0000000000..4d88791d16 --- /dev/null +++ b/data/2024/aaai/LLM-Powered Synthetic Environments for Self-Driving Scenarios @@ -0,0 +1,2 @@ +This paper outlines a proposal exploring the potential use of Large Language Models (LLMs), particularly GPT-4, in crafting realistic synthetic environments for self-driving scenarios. The envisioned approach involves dynamic scene generation within game engines, leveraging LLMs to introduce challenging elements for autonomous vehicles. The proposed evaluation process outlines assessments such as realistic testing, safety metrics, and user interaction, aiming to set the stage for potential improvements in self-driving system performance. +The paper aims to contribute to the AI field by discussing how LLMs could be utilized to create valuable testing grounds for autonomous vehicles, potentially fostering the development of more robust self-driving technology. The envisioned impact is the eventual enhancement of road safety and the possible acceleration of the adoption of autonomous vehicles, paving the way for a future with safer and more efficient transportation. \ No newline at end of file diff --git a/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models b/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models new file mode 100644 index 0000000000..51b877a84c --- /dev/null +++ b/data/2024/aaai/LLMEval: A Preliminary Study on How to Evaluate Large Language Models @@ -0,0 +1,10 @@ +Recently, the evaluation of Large Language Models has emerged as a popular area of research. +The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. +However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. +As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. +In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. +We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. +A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. +We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at +https://github.com/llmeval. +The version with the appendix are publicly available at https://arxiv.org/abs/2312.07398. \ No newline at end of file diff --git a/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior b/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior new file mode 100644 index 0000000000..821153865c --- /dev/null +++ b/data/2024/aaai/LLMGuard: Guarding against Unsafe LLM Behavior @@ -0,0 +1,2 @@ +Although the rise of Large Language Models (LLMs) in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns. +To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors. \ No newline at end of file diff --git a/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs b/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs new file mode 100644 index 0000000000..1278b97b5e --- /dev/null +++ b/data/2024/aaai/LLMRG: Improving Recommendations through Large Language Model Reasoning Graphs @@ -0,0 +1 @@ +Recommendation systems aim to provide users with relevant suggestions, but often lack interpretability and fail to capture higher-level semantic relationships between user behaviors and profiles. In this paper, we propose a novel approach that leverages large language models (LLMs) to construct personalized reasoning graphs. These graphs link a user's profile and behavioral sequences through causal and logical inferences, representing the user's interests in an interpretable way. Our approach, LLM reasoning graphs (LLMRG), has four components: chained graph reasoning, divergent extension, self-verification and scoring, and knowledge base self-improvement. The resulting reasoning graph is encoded using graph neural networks, which serves as additional input to improve conventional recommender systems, without requiring extra user or item information. Our approach demonstrates how LLMs can enable more logical and interpretable recommender systems through personalized reasoning graphs. LLMRG allows recommendations to benefit from both engineered recommendation systems and LLM-derived reasoning graphs. We demonstrate the effectiveness of LLMRG on benchmarks and real-world scenarios in enhancing base recommendation models. \ No newline at end of file diff --git a/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion b/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion new file mode 100644 index 0000000000..0c32b250ea --- /dev/null +++ b/data/2024/aaai/LMD: Faster Image Reconstruction with Latent Masking Diffusion @@ -0,0 +1 @@ +As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with Latent Masking Diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches. \ No newline at end of file diff --git a/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning b/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning new file mode 100644 index 0000000000..da2385d221 --- /dev/null +++ b/data/2024/aaai/LR-XFL: Logical Reasoning-Based Explainable Federated Learning @@ -0,0 +1 @@ +Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logic-based explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients’ local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model’s robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important. \ No newline at end of file diff --git a/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network b/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network new file mode 100644 index 0000000000..123b0802f8 --- /dev/null +++ b/data/2024/aaai/LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network @@ -0,0 +1 @@ +Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: https://github.com/ychensu/LRANet.git \ No newline at end of file diff --git a/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification b/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification new file mode 100644 index 0000000000..ada193f5f9 --- /dev/null +++ b/data/2024/aaai/LSTKC: Long Short-Term Knowledge Consolidation for Lifelong Person Re-identification @@ -0,0 +1 @@ +Lifelong person re-identification (LReID) aims to train a unified model from diverse data sources step by step. The severe domain gaps between different training steps result in catastrophic forgetting in LReID, and existing methods mainly rely on data replay and knowledge distillation techniques to handle this issue. However, the former solution needs to store historical exemplars which inevitably impedes data privacy. The existing knowledge distillation-based models usually retain all the knowledge of the learned old models without any selections, which will inevitably include erroneous and detrimental knowledge that severely impacts the learning performance of the new model. To address these issues, we propose an exemplar-free LReID method named LongShort Term Knowledge Consolidation (LSTKC) that contains a Rectification-based Short-Term Knowledge Transfer module (R-STKT) and an Estimation-based Long-Term Knowledge Consolidation module (E-LTKC). For each learning iteration within one training step, R-STKT aims to filter and rectify the erroneous knowledge contained in the old model and transfer the rectified knowledge to facilitate the short-term learning of the new model. Meanwhile, once one training step is finished, E-LTKC proposes to further consolidate the learned long-term knowledge via adaptively fusing the parameters of models from different steps. Consequently, experimental results show that our LSTKC exceeds the state-of-the-art methods by 6.3%/9.4% and 7.9%/4.5%, 6.4%/8.0% and 9.0%/5.5% average mAP/R@1 on seen and unseen domains under two different training orders of the challenging LReID benchmark respectively. \ No newline at end of file diff --git a/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract) b/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract) new file mode 100644 index 0000000000..6cc8add2bc --- /dev/null +++ b/data/2024/aaai/LaMAR: Laplacian Pyramid for Multimodal Adaptive Super Resolution (Student Abstract) @@ -0,0 +1 @@ +Recent advances in image-to-image translation involve the integration of non-visual imagery in deep models. Non-visual sensors, although more costly, often produce low-resolution images. To combat this, methods using RGB images to enhance the resolution of these modalities have been introduced. Fusing these modalities to achieve high-resolution results demands models with millions of parameters and extended inference times. We present LaMAR, a lightweight model. It employs Laplacian image pyramids combined with a low-resolution thermal image for Guided Thermal Super Resolution. By decomposing the RGB image into a Laplacian pyramid, LaMAR preserves image details and avoids high-resolution feature map computations, ensuring efficiency. With faster inference times and fewer parameters, our model demonstrates state-of-the-art results. \ No newline at end of file diff --git a/data/2024/aaai/LaViP: Language-Grounded Visual Prompting b/data/2024/aaai/LaViP: Language-Grounded Visual Prompting new file mode 100644 index 0000000000..a4d005dd37 --- /dev/null +++ b/data/2024/aaai/LaViP: Language-Grounded Visual Prompting @@ -0,0 +1 @@ +We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot adaptation, base-to-novel class generalization, and transfer learning. \ No newline at end of file diff --git a/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification b/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification new file mode 100644 index 0000000000..0b6bb10aa6 --- /dev/null +++ b/data/2024/aaai/Label Attentive Distillation for GNN-Based Graph Classification @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have emerged as a powerful tool for modeling graph-structured data, exhibiting remarkable potential in applications such as social networks, recommendation systems, and molecular structures. However, the conventional GNNs perform node-level feature aggregation from neighbors without considering graph-label information, which leads to the misaligned embedding problem that may cause a detrimental effect on graph-level tasks such as graph classification. In this paper, we propose a novel label-attentive distillation method called LAD-GNN for graph representation learning to solve this problem. It alternatively trains a teacher model and a student GNN with a distillation-based approach. In the teacher model, a label-attentive encoder is proposed to encode the label information fusing with the node features to generate ideal embedding. In the student model, the ideal embedding is used as intermediate supervision to urge the student GNN to learn class-friendly node embedding to facilitate graph-level tasks. Generally, LAD-GNN is an enhanced GNN training approach that can be incorporated with arbitrary GNN backbone to improve performance without significant increase of computational cost. Extensive experiments with 7 GNN backbones based on 10 benchmark datasets show that LAD-GNN improves the SOTA GNNs in graph classification accuracy. The source codes of LAD-GNN are publicly available on https://github.com/XiaobinHong/LAD-GNN. \ No newline at end of file diff --git a/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training b/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training new file mode 100644 index 0000000000..0c0f214bd6 --- /dev/null +++ b/data/2024/aaai/Label-Efficient Few-Shot Semantic Segmentation with Unsupervised Meta-Training @@ -0,0 +1 @@ +The goal of this paper is to alleviate the training cost for few-shot semantic segmentation (FSS) models. Despite that FSS in nature improves model generalization to new concepts using only a handful of test exemplars, it relies on strong supervision from a considerable amount of labeled training data for base classes. However, collecting pixel-level annotations is notoriously expensive and time-consuming, and small-scale training datasets convey low information density that limits test-time generalization. To resolve the issue, we take a pioneering step towards label-efficient training of FSS models from fully unlabeled training data, or additionally a few labeled samples to enhance the performance. This motivates an approach based on a novel unsupervised meta-training paradigm. In particular, the approach first distills pre-trained unsupervised pixel embedding into compact semantic clusters from which a massive number of pseudo meta-tasks is constructed. To mitigate the noise in the pseudo meta-tasks, we further advocate a robust Transformer-based FSS model with a novel prototype-based cross-attention design. Extensive experiments have been conducted on two standard benchmarks, i.e., PASCAL-5i and COCO-20i, and the results show that our method produces impressive performance without any annotations, and is comparable to fully supervised competitors even using only 20% of the annotations. Our code is available at: https://github.com/SSSKYue/UMTFSS. \ No newline at end of file diff --git a/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks b/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks new file mode 100644 index 0000000000..6e412163ad --- /dev/null +++ b/data/2024/aaai/Labels Need Prompts Too: Mask Matching for Natural Language Understanding Tasks @@ -0,0 +1 @@ +Textual label names (descriptions) are typically semantically rich in many natural language understanding (NLU) tasks. In this paper, we incorporate the prompting methodology, which is widely used to enrich model input, into the label side for the first time. Specifically, we propose a Mask Matching method, which equips an input with a prompt and its label with another, and then makes predictions by matching their mask representations. We evaluate our method extensively on 8 NLU tasks with 14 datasets. The experimental results show that Mask Matching significantly outperforms its counterparts of fine-tuning and conventional prompt-tuning, setting up state-of-the-art performances in several datasets. Mask Matching is particularly good at handling NLU tasks with large label counts and informative label names. As pioneering efforts that investigate the label-side prompt, we also discuss open issues for future study. \ No newline at end of file diff --git a/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement b/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement new file mode 100644 index 0000000000..f1a2e2029c --- /dev/null +++ b/data/2024/aaai/LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement @@ -0,0 +1,3 @@ +Understanding road structures is crucial for autonomous driving. Intricate road structures are often depicted using lane graphs, which include centerline curves and connections forming a Directed Acyclic Graph (DAG). Accurate extraction of lane graphs relies on precisely estimating vertex and edge information within the DAG. +Recent research highlights Transformer-based language models' impressive sequence prediction abilities, making them effective for learning graph representations when graph data are encoded as sequences. However, existing studies focus mainly on modeling vertices explicitly, leaving edge information simply embedded in the network. +Consequently, these approaches fall short in the task of lane graph extraction. To address this, we introduce LaneGraph2Seq, a novel approach for lane graph extraction. It leverages a language model with vertex-edge encoding and connectivity enhancement. Our serialization strategy includes a vertex-centric depth-first traversal and a concise edge-based partition sequence. Additionally, we use classifier-free guidance combined with nucleus sampling to improve lane connectivity. We validate our method on prominent datasets, nuScenes and Argoverse 2, showcasing consistent and compelling results. Our LaneGraph2Seq approach demonstrates superior performance compared to state-of-the-art techniques in lane graph extraction. \ No newline at end of file diff --git a/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification b/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification new file mode 100644 index 0000000000..f44d534487 --- /dev/null +++ b/data/2024/aaai/Language-Guided Transformer for Federated Multi-Label Classification @@ -0,0 +1 @@ +Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT. \ No newline at end of file diff --git a/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales b/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales new file mode 100644 index 0000000000..561f9a28a6 --- /dev/null +++ b/data/2024/aaai/Large Language Models Are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales @@ -0,0 +1 @@ +Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLP-driven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area. \ No newline at end of file diff --git a/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners b/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners new file mode 100644 index 0000000000..37ecb64f06 --- /dev/null +++ b/data/2024/aaai/Large Language Models Are Neurosymbolic Reasoners @@ -0,0 +1 @@ +A wide range of real-world applications is characterized by their symbolic nature, necessitating a strong capability for symbolic reasoning. This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners. We focus on text-based games, significant benchmarks for agents with natural language capabilities, particularly in symbolic tasks like math, map reading, sorting, and applying common sense in text-based worlds. To facilitate these agents, we propose an LLM agent designed to tackle symbolic challenges and achieve in-game objectives. We begin by initializing the LLM agent and informing it of its role. The agent then receives observations and a set of valid actions from the text-based games, along with a specific symbolic module. With these inputs, the LLM agent chooses an action and interacts with the game environments. Our experimental results demonstrate that our method significantly enhances the capability of LLMs as automated agents for symbolic reasoning, and our LLM agent is effective in text-based games involving symbolic tasks, achieving an average performance of 88% across all tasks. \ No newline at end of file diff --git a/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract) b/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract) new file mode 100644 index 0000000000..16df4ed635 --- /dev/null +++ b/data/2024/aaai/Large Language Models as Planning Domain Generators (Student Abstract) @@ -0,0 +1,14 @@ +The creation of planning models, and in particular domain +models, is among the last bastions of tasks that require exten- +sive manual labor in AI planning; it is desirable to simplify +this process for the sake of making planning more accessi- +ble. To this end, we investigate whether large language mod- +els (LLMs) can be used to generate planning domain models +from textual descriptions. We propose a novel task for this +as well as a means of automated evaluation for generated do- +mains by comparing the sets of plans for domain instances. +Finally, we perform an empirical analysis of 7 large language +models, including coding and chat models across 9 different +planning domains. Our results show that LLMs, particularly +larger ones, exhibit some level of proficiency in generating +correct planning domains from natural language descriptions \ No newline at end of file diff --git a/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating b/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating new file mode 100644 index 0000000000..fcd64409b8 --- /dev/null +++ b/data/2024/aaai/Large Occluded Human Image Completion via Image-Prior Cooperating @@ -0,0 +1 @@ +The completion of large occluded human body images poses a unique challenge for general image completion methods. The complex shape variations of human bodies make it difficult to establish a consistent understanding of their structures. Furthermore, as human vision is highly sensitive to human bodies, even slight artifacts can significantly compromise image fidelity. To address these challenges, we propose a large occluded human image completion (LOHC) model based on a novel image-prior cooperative completion strategy. Our model leverages human segmentation maps as a prior, and completes the image and prior simultaneously. Compared to the widely adopted prior-then-image completion strategy for object completion, this cooperative completion process fosters more effective interaction between the prior and image information. Our model consists of two stages. The first stage is a transformer-based auto-regressive network that predicts the overall structure of the missing area by generating a coarse completed image at a lower resolution. The second stage is a convolutional network that refines the coarse images. As the coarse result may not always be accurate, we propose a Dynamic Fusion Module (DFM) to selectively fuses the useful features from the coarse image with the original input at spatial and channel levels. Through extensive experiments, we demonstrate our method’s superior performance compared to state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search b/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search new file mode 100644 index 0000000000..d4652c8d87 --- /dev/null +++ b/data/2024/aaai/Large-Scale Multi-Robot Coverage Path Planning via Local Search @@ -0,0 +1,2 @@ +We study graph-based Multi-Robot Coverage Path Planning (MCPP) that aims to compute coverage paths for multiple robots to cover all vertices of a given 2D grid terrain graph G. Existing graph-based MCPP algorithms first compute a tree cover on G---a forest of multiple trees that cover all vertices---and then employ the Spanning Tree Coverage (STC) paradigm to generate coverage paths on the decomposed graph D of the terrain graph G by circumnavigating the edges of the computed trees, aiming to optimize the makespan (i.e., the maximum coverage path cost among all robots). +In this paper, we take a different approach by exploring how to systematically search for good coverage paths directly on D. We introduce a new algorithmic framework, called LS-MCPP, which leverages a local search to operate directly on D. We propose a novel standalone paradigm, Extended-STC (ESTC), that extends STC to achieve complete coverage for MCPP on any decomposed graph, even those resulting from incomplete terrain graphs. Furthermore, we demonstrate how to integrate ESTC with three novel types of neighborhood operators into our framework to effectively guide its search process. Our extensive experiments demonstrate the effectiveness of LS-MCPP, consistently improving the initial solution returned by two state-of-the-art baseline algorithms that compute suboptimal tree covers on G, with a notable reduction in makespan by up to 35.7% and 30.3%, respectively. Moreover, LS-MCPP consistently matches or surpasses the results of optimal tree cover computation, achieving these outcomes with orders of magnitude faster runtime, thereby showcasing its significant benefits for large-scale real-world coverage tasks. \ No newline at end of file diff --git a/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization b/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization new file mode 100644 index 0000000000..9a2bd90df1 --- /dev/null +++ b/data/2024/aaai/Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization @@ -0,0 +1 @@ +Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes chi^2-divergences as a special case. We prove that our algorithm finds an epsilon-stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVaR) DRO. \ No newline at end of file diff --git a/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting b/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting new file mode 100644 index 0000000000..197bb040ed --- /dev/null +++ b/data/2024/aaai/Latent Diffusion Transformer for Probabilistic Time Series Forecasting @@ -0,0 +1 @@ +The probability prediction of multivariate time series is a notoriously challenging but practical task. This research proposes to condense high-dimensional multivariate time series forecasting into a problem of latent space time series generation, to improve the expressiveness of each timestamp and make forecasting more manageable. To solve the problem that the existing work is hard to extend to high-dimensional multivariate time series, we present a latent multivariate time series diffusion framework called Latent Diffusion Transformer (LDT), which consists of a symmetric statistics-aware autoencoder and a diffusion-based conditional generator, to implement this idea. Through careful design, the time series autoencoder can compress multivariate timestamp patterns into a concise latent representation by considering dynamic statistics. Then, the diffusion-based conditional generator is able to efficiently generate realistic multivariate timestamp values on a continuous latent space under a novel self-conditioning guidance which is modeled in a non-autoregressive way. Extensive experiments demonstrate that our model achieves state-of-the-art performance on many popular high-dimensional multivariate time series datasets. \ No newline at end of file diff --git a/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching b/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching new file mode 100644 index 0000000000..91c9998d96 --- /dev/null +++ b/data/2024/aaai/Latent Space Editing in Transformer-Based Flow Matching @@ -0,0 +1 @@ +This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call u-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/ \ No newline at end of file diff --git a/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction b/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction new file mode 100644 index 0000000000..326b5ce84f --- /dev/null +++ b/data/2024/aaai/LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction @@ -0,0 +1 @@ +Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval. \ No newline at end of file diff --git a/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation b/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation new file mode 100644 index 0000000000..3ad6913c2c --- /dev/null +++ b/data/2024/aaai/Layer Attack Unlearning: Fast and Accurate Machine Unlearning via Layer Level Attack and Knowledge Distillation @@ -0,0 +1 @@ +Recently, serious concerns have been raised about the privacy issues related to training datasets in machine learning algorithms when including personal data. Various regulations in different countries, including the GDPR grant individuals to have personal data erased, known as ‘the right to be forgotten’ or ‘the right to erasure’. However, there has been less research on effectively and practically deleting the requested personal data from the training set while not jeopardizing the overall machine learning performance. In this work, we propose a fast and novel machine unlearning paradigm at the layer level called layer attack unlearning, which is highly accurate and fast compared to existing machine unlearning algorithms. We introduce the Partial-PGD algorithm to locate the samples to forget efficiently. In addition, we only use the last layer of the model inspired by the Forward-Forward algorithm for unlearning process. Lastly, we use Knowledge Distillation (KD) to reliably learn the decision boundaries from the teacher using soft label information to improve accuracy performance. We conducted extensive experiments with SOTA machine unlearning models and demonstrated the effectiveness of our approach for accuracy and end-to-end unlearning performance. \ No newline at end of file diff --git a/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm b/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm new file mode 100644 index 0000000000..53bcc41896 --- /dev/null +++ b/data/2024/aaai/Layer Collaboration in the Forward-Forward Algorithm @@ -0,0 +1 @@ +Backpropagation, which uses the chain rule, is the de-facto standard algorithm for optimizing neural networks nowadays. Recently, Hinton (2022) proposed the forward-forward algorithm, a promising alternative that optimizes neural nets layer-by-layer, without propagating gradients throughout the network. Although such an approach has several advantages over back-propagation and shows promising results, the fact that each layer is being trained independently limits the optimization process. Specifically, it prevents the network's layers from collaborating to learn complex and rich features. In this work, we study layer collaboration in the forward-forward algorithm. We show that the current version of the forward-forward algorithm is suboptimal when considering information flow in the network, resulting in a lack of collaboration between layers of the network. We propose an improved version that supports layer collaboration to better utilize the network structure, while not requiring any additional assumptions or computations. We empirically demonstrate the efficacy of the proposed version when considering both information flow and objective metrics. Additionally, we provide a theoretical motivation for the proposed method, inspired by functional entropy theory. \ No newline at end of file diff --git a/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows b/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows new file mode 100644 index 0000000000..fa6867a658 --- /dev/null +++ b/data/2024/aaai/Layer Compression of Deep Networks with Straight Flows @@ -0,0 +1,7 @@ +Very deep neural networks lead to significantly better performance on various real tasks. However, it usually causes slow inference and is hard to be deployed on real-world devices. How to reduce the number of layers to save memory and to accelerate the inference is an eye-catching topic. + In this work, we introduce an intermediate objective, a continuous-time network, before distilling deep networks into shallow networks. + First, we distill a given deep network into a continuous-time neural flow model, which can be discretized with an ODE solver and the inference requires passing through the network multiple times. + By forcing the flow transport trajectory to be straight lines, we find that it is easier to compress the infinite step model into a one-step neural flow model, which only requires passing through the flow model once. + Secondly, we refine the one-step flow model together with the final head layer with knowledge distillation and finally, we can replace the given deep network with this one-step flow network. + Empirically, we demonstrate that our method outperforms direct distillation and other baselines on different model architectures (e.g. ResNet, ViT) on image classification and semantic segmentation tasks. + We also manifest that our distilled model naturally serves as an early-exit dynamic inference model. \ No newline at end of file diff --git a/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization b/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization new file mode 100644 index 0000000000..d2e5d74c7e --- /dev/null +++ b/data/2024/aaai/Layer-Wise Representation Fusion for Compositional Generalization @@ -0,0 +1 @@ +Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layer-wise Representation Fusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF. \ No newline at end of file diff --git a/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting b/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting new file mode 100644 index 0000000000..38e54cedd5 --- /dev/null +++ b/data/2024/aaai/Learn How to See: Collaborative Embodied Learning for Object Detection and Camera Adjusting @@ -0,0 +1 @@ +Passive object detectors, trained on large-scale static datasets, often overlook the feedback from object detection to image acquisition. Embodied vision and active detection mitigate this issue by interacting with the environment. Nevertheless, the materialization of activeness hinges on resource-intensive data collection and annotation. To tackle these challenges, we propose a collaborative student-teacher framework. Technically, a replay buffer is built based on the trajectory data to encapsulate the relationship of state, action, and reward. In addition, the student network diverges from reinforcement learning by redefining sequential decision pathways using a GPT structure enriched with causal self-attention. Moreover, the teacher network establishes a subtle state-reward mapping based on adjacent benefit differences, providing reliable rewards for student adaptively self-tuning with the vast unlabeled replay buffer data. Additionally, an innovative yet straightforward benefit reference value is proposed within the teacher network, adding to its effectiveness and simplicity. Leveraging a flexible replay buffer and embodied collaboration between teacher and student, the framework learns to see before detection with shallower features and shorter inference steps. Experiments highlight significant advantages of our algorithm over state-of-the-art detectors. The code is released at https://github.com/lydonShen/STF. \ No newline at end of file diff --git a/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation b/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation new file mode 100644 index 0000000000..24fb67f6ed --- /dev/null +++ b/data/2024/aaai/Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation @@ -0,0 +1 @@ +We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. Although our model has never been given the explicit segmentation and motion of each object in the scene during training, it is able to implicitly separate their dynamics and extents. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to enable generalization to out of distribution but realistic correlations. Our model, which we call YODA, has therefore the ability to move objects without physically touching them. Through extensive qualitative and quantitative evaluations on several datasets, we show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality. \ No newline at end of file diff --git a/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning b/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning new file mode 100644 index 0000000000..dee14d1af6 --- /dev/null +++ b/data/2024/aaai/Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning @@ -0,0 +1 @@ +Multi-agent Pathfinding (MAPF) problem generally asks to find a set of conflict-free paths for a set of agents confined to a graph and is typically solved in a centralized fashion. Conversely, in this work, we investigate the decentralized MAPF setting, when the central controller that possesses all the information on the agents' locations and goals is absent and the agents have to sequentially decide the actions on their own without having access to the full state of the environment. We focus on the practically important lifelong variant of MAPF, which involves continuously assigning new goals to the agents upon arrival to the previous ones. To address this complex problem, we propose a method that integrates two complementary approaches: planning with heuristic search and reinforcement learning through policy optimization. Planning is utilized to construct and re-plan individual paths. We enhance our planning algorithm with a dedicated technique tailored to avoid congestion and increase the throughput of the system. We employ reinforcement learning to discover the collision avoidance policies that effectively guide the agents along the paths. The policy is implemented as a neural network and is effectively trained without any reward-shaping or external guidance. We evaluate our method on a wide range of setups comparing it to the state-of-the-art solvers. The results show that our method consistently outperforms the learnable competitors, showing higher throughput and better ability to generalize to the maps that were unseen at the training stage. Moreover our solver outperforms a rule-based one in terms of throughput and is an order of magnitude faster than a state-of-the-art search-based solver. The code is available at https://github.com/AIRI-Institute/learn-to-follow. \ No newline at end of file diff --git a/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation b/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation new file mode 100644 index 0000000000..dfd0e69ec1 --- /dev/null +++ b/data/2024/aaai/Learning Accurate and Bidirectional Transformation via Dynamic Embedding Transportation for Cross-Domain Recommendation @@ -0,0 +1,2 @@ +With the rapid development of Internet and Web techniques, Cross-Domain Recommendation (CDR) models have been widely explored for resolving the data-sparsity +and cold-start problem. Meanwhile, most CDR models should utilize explicit domain-shareable information (e.g., overlapped users or items) for knowledge transfer across domains. However, this assumption may not be always satisfied since users and items are always non-overlapped in real practice. The performance of many previous works will be severely impaired when these domain-shareable information are not available. To address the aforementioned issues, we propose the Joint Preference Exploration and Dynamic Embedding Transportation model (JPEDET) in this paper which is a novel framework for solving the CDR problem when users and items are non-overlapped. JPEDET includes two main modules, i.e., joint preference exploration module and dynamic embedding transportation module. The joint preference exploration module aims to fuse rating and review information for modelling user preferences. The dynamic embedding transportation module is set to share knowledge via neural ordinary equations for dual transformation across domains. Moreover, we innovatively propose the dynamic transport flow equipped with linear interpolation guidance on barycentric Wasserstein path for achieving accurate and bidirectional transformation. Our empirical study on Amazon datasets demonstrates that JPEDET significantly outperforms the state-of-the-art models under the CDR setting. \ No newline at end of file diff --git a/data/2024/aaai/Learning Broadcast Protocols b/data/2024/aaai/Learning Broadcast Protocols new file mode 100644 index 0000000000..0fdc37a52e --- /dev/null +++ b/data/2024/aaai/Learning Broadcast Protocols @@ -0,0 +1 @@ +The problem of learning a computational model from examples has been receiving growing attention. For the particularly challenging problem of learning models of distributed systems, existing results are restricted to models with a fixed number of interacting processes. In this work we look for the first time (to the best of our knowledge) at the problem of learning a distributed system with an arbitrary number of processes, assuming only that there exists a cutoff, i.e., a number of processes that is sufficient to produce all observable behaviors. Specifically, we consider fine broadcast protocols, these are broadcast protocols (BPs) with a finite cutoff and no hidden states. We provide a learning algorithm that can infer a correct BP from a sample that is consistent with a fine BP, and a minimal equivalent BP if the sample is sufficiently complete. On the negative side we show that (a) characteristic sets of exponential size are unavoidable, (b) the consistency problem for fine BPs is NP hard, and (c) that fine BPs are not polynomially predictable. \ No newline at end of file diff --git a/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering b/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering new file mode 100644 index 0000000000..97ee5a9ff7 --- /dev/null +++ b/data/2024/aaai/Learning Cluster-Wise Anchors for Multi-View Clustering @@ -0,0 +1 @@ +Due to its effectiveness and efficiency, anchor based multi-view clustering (MVC) has recently attracted much attention. Most existing approaches try to adaptively learn anchors to construct an anchor graph for clustering. However, they generally focus on improving the diversity among anchors by using orthogonal constraint and ignore the underlying semantic relations, which may make the anchors not representative and discriminative enough. To address this problem, we propose an adaptive Cluster-wise Anchor learning based MVC method, CAMVC for short. We first make an anchor cluster assumption that supposes the prior cluster structure of target anchors by pre-defining a consensus cluster indicator matrix. Based on the prior knowledge, an explicit cluster structure of latent anchors is enforced by learning diverse cluster centroids, which can explore both inter-cluster diversity and intra-cluster consistency of anchors, and improve the subspace representation discrimination. Extensive results demonstrate the effectiveness and superiority of our proposed method compared with some state-of-the-art MVC approaches. \ No newline at end of file diff --git a/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling b/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling new file mode 100644 index 0000000000..2dfd31c4ce --- /dev/null +++ b/data/2024/aaai/Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling @@ -0,0 +1 @@ +Point cloud upsampling aims to generate dense and uniformly distributed point sets from a sparse point cloud, which plays a critical role in 3D computer vision. Previous methods typically split a sparse point cloud into several local patches, upsample patch points, and merge all upsampled patches. However, these methods often produce holes, outliers or non-uniformity due to the splitting and merging process which does not maintain consistency among local patches.To address these issues, we propose a novel approach that learns an unsigned distance field guided by local priors for point cloud upsampling. Specifically, we train a local distance indicator (LDI) that predicts the unsigned distance from a query point to a local implicit surface. Utilizing the learned LDI, we learn an unsigned distance field to represent the sparse point cloud with patch consistency. At inference time, we randomly sample queries around the sparse point cloud, and project these query points onto the zero-level set of the learned implicit field to generate a dense point cloud. We justify that the implicit field is naturally continuous, which inherently enables the application of arbitrary-scale upsampling without necessarily retraining for various scales. We conduct comprehensive experiments on both synthetic data and real scans, and report state-of-the-art results under widely used benchmarks. Project page: https://lisj575.github.io/APU-LDI \ No newline at end of file diff --git a/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo b/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo new file mode 100644 index 0000000000..5268346274 --- /dev/null +++ b/data/2024/aaai/Learning Deformable Hypothesis Sampling for Accurate PatchMatch Multi-View Stereo @@ -0,0 +1 @@ +This paper introduces a learnable Deformable Hypothesis Sampler (DeformSampler) to address the challenging issue of noisy depth estimation in faithful PatchMatch multi-view stereo (MVS). We observe that the heuristic depth hypothesis sampling modes employed by PatchMatch MVS solvers are insensitive to (i) the piece-wise smooth distribution of depths across the object surface and (ii) the implicit multi-modal distribution of depth prediction probabilities along the ray direction on the surface points. Accordingly, we develop DeformSampler to learn distribution-sensitive sample spaces to (i) propagate depths consistent with the scene's geometry across the object surface and (ii) fit a Laplace Mixture model that approaches the point-wise probabilities distribution of the actual depths along the ray direction. We integrate DeformSampler into a learnable PatchMatch MVS system to enhance depth estimation in challenging areas, such as piece-wise discontinuous surface boundaries and weakly-textured regions. Experimental results on DTU and Tanks & Temples datasets demonstrate its superior performance and generalization capabilities compared to state-of-the-art competitors. Code is available at https://github.com/Geo-Tell/DS-PMNet. \ No newline at end of file diff --git a/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment b/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment new file mode 100644 index 0000000000..effd764212 --- /dev/null +++ b/data/2024/aaai/Learning Dense Correspondence for NeRF-Based Face Reenactment @@ -0,0 +1 @@ +Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods. \ No newline at end of file diff --git a/data/2024/aaai/Learning Diffusions under Uncertainty b/data/2024/aaai/Learning Diffusions under Uncertainty new file mode 100644 index 0000000000..ca083f8435 --- /dev/null +++ b/data/2024/aaai/Learning Diffusions under Uncertainty @@ -0,0 +1 @@ +To infer a diffusion network based on observations from historical diffusion processes, existing approaches assume that observation data contain exact occurrence time of each node infection, or at least the eventual infection statuses of nodes in each diffusion process. They determine potential influence relationships between nodes by identifying frequent sequences, or statistical correlations, among node infections. In some real-world settings, such as the spread of epidemics, tracing exact infection times is often infeasible due to a high cost; even obtaining precise infection statuses of nodes is a challenging task, since observable symptoms such as headache only partially reveal a node’s true status. In this work, we investigate how to effectively infer a diffusion network from observation data with uncertainty. Provided with only probabilistic information about node infection statuses, we formulate the problem of diffusion network inference as a constrained nonlinear regression w.r.t. the probabilistic data. An alternating maximization method is designed to solve this regression problem iteratively, and the improvement of solution quality in each iteration can be theoretically guaranteed. Empirical studies are conducted on both synthetic and real-world networks, and the results verify the effectiveness and efficiency of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games b/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games new file mode 100644 index 0000000000..ab4f38aff3 --- /dev/null +++ b/data/2024/aaai/Learning Discrete-Time Major-Minor Mean Field Games @@ -0,0 +1 @@ +Recent techniques based on Mean Field Games (MFGs) allow the scalable analysis of multi-player games with many similar, rational agents. However, standard MFGs remain limited to homogeneous players that weakly influence each other, and cannot model major players that strongly influence other players, severely limiting the class of problems that can be handled. We propose a novel discrete time version of major-minor MFGs (M3FGs), along with a learning algorithm based on fictitious play and partitioning the probability simplex. Importantly, M3FGs generalize MFGs with common noise and can handle not only random exogeneous environment states but also major players. A key challenge is that the mean field is stochastic and not deterministic as in standard MFGs. Our theoretical investigation verifies both the M3FG model and its algorithmic solution, showing firstly the well-posedness of the M3FG model starting from a finite game of interest, and secondly convergence and approximation guarantees of the fictitious play algorithm. Then, we empirically verify the obtained theoretical results, ablating some of the theoretical assumptions made, and show successful equilibrium learning in three example problems. Overall, we establish a learning framework for a novel and broad class of tractable games. \ No newline at end of file diff --git a/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization b/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization new file mode 100644 index 0000000000..c3cb3f5423 --- /dev/null +++ b/data/2024/aaai/Learning Discriminative Noise Guidance for Image Forgery Detection and Localization @@ -0,0 +1 @@ +This study introduces a new method for detecting and localizing image forgery by focusing on manipulation traces within the noise domain. We posit that nearly invisible noise in RGB images carries tampering traces, useful for distinguishing and locating forgeries. However, the advancement of tampering technology complicates the direct application of noise for forgery detection, as the noise inconsistency between forged and authentic regions is not fully exploited. To tackle this, we develop a two-step discriminative noise-guided approach to explicitly enhance the representation and use of noise inconsistencies, thereby fully exploiting noise information to improve the accuracy and robustness of forgery detection. Specifically, we first enhance the noise discriminability of forged regions compared to authentic ones using a de-noising network and a statistics-based constraint. Then, we merge a model-driven guided filtering mechanism with a data-driven attention mechanism to create a learnable and differentiable noise-guided filter. This sophisticated filter allows us to maintain the edges of forged regions learned from the noise. Comprehensive experiments on multiple datasets demonstrate that our method can reliably detect and localize forgeries, surpassing existing state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play b/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play new file mode 100644 index 0000000000..680fadbafe --- /dev/null +++ b/data/2024/aaai/Learning Diverse Risk Preferences in Population-Based Self-Play @@ -0,0 +1 @@ +Among the remarkable successes of Reinforcement Learning (RL), self-play algorithms have played a crucial role in solving competitive games. However, current self-play RL methods commonly optimize the agent to maximize the expected win-rates against its current or historical copies, resulting in a limited strategy style and a tendency to get stuck in local optima. To address this limitation, it is important to improve the diversity of policies, allowing the agent to break stalemates and enhance its robustness when facing with different opponents. In this paper, we present a novel perspective to promote diversity by considering that agents could have diverse risk preferences in the face of uncertainty. To achieve this, we introduce a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning, enabling policy learning with desired risk preferences. Furthermore, by seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives using experiences gained from playing against diverse opponents. Our empirical results demonstrate that our method achieves comparable or superior performance in competitive games and, importantly, leads to the emergence of diverse behavioral modes. Code is available at https://github.com/Jackory/RPBT. \ No newline at end of file diff --git a/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning b/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning new file mode 100644 index 0000000000..812d8b4740 --- /dev/null +++ b/data/2024/aaai/Learning Domain-Independent Heuristics for Grounded and Lifted Planning @@ -0,0 +1 @@ +We present three novel graph representations of planning tasks suitable for learning domain-independent heuristics using Graph Neural Networks (GNNs) to guide search. In particular, to mitigate the issues caused by large grounded GNNs we present the first method for learning domain-independent heuristics with only the lifted representation of a planning task. We also provide a theoretical analysis of the expressiveness of our models, showing that some are more powerful than STRIPS-HGN, the only other existing model for learning domain-independent heuristics. Our experiments show that our heuristics generalise to much larger problems than those in the training set, vastly surpassing STRIPS-HGN heuristics. \ No newline at end of file diff --git a/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck b/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck new file mode 100644 index 0000000000..35e7023409 --- /dev/null +++ b/data/2024/aaai/Learning Efficient and Robust Multi-Agent Communication via Graph Information Bottleneck @@ -0,0 +1 @@ +Efficient communication learning among agents has been shown crucial for cooperative multi-agent reinforcement learning (MARL), as it can promote the action coordination of agents and ultimately improve performance. Graph neural network (GNN) provide a general paradigm for communication learning, which consider agents and communication channels as nodes and edges in a graph, with the action selection corresponding to node labeling. Under such paradigm, an agent aggregates information from neighbor agents, which can reduce uncertainty in local decision-making and induce implicit action coordination. However, this communication paradigm is vulnerable to adversarial attacks and noise, and how to learn robust and efficient communication under perturbations has largely not been studied. To this end, this paper introduces a novel Multi-Agent communication mechanism via Graph Information bottleneck (MAGI), which can optimally balance the robustness and expressiveness of the message representation learned by agents. This communication mechanism is aim at learning the minimal sufficient message representation for an agent by maximizing the mutual information (MI) between the message representation and the selected action, and simultaneously constraining the MI between the message representation and the agent feature. Empirical results demonstrate that MAGI is more robust and efficient than state-of-the-art GNN-based MARL methods. \ No newline at end of file diff --git a/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret b/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret new file mode 100644 index 0000000000..da3d43a180 --- /dev/null +++ b/data/2024/aaai/Learning Encodings for Constructive Neural Combinatorial Optimization Needs to Regret @@ -0,0 +1 @@ +Deep-reinforcement-learning (DRL) based neural combinatorial optimization (NCO) methods have demonstrated efficiency without relying on the guidance of optimal solutions. As the most mainstream among them, the learning constructive heuristic (LCH) achieves high-quality solutions through a rapid autoregressive solution construction process. However, these LCH-based methods are deficient in convergency, and there is still a performance gap compared to the optimal. Intuitively, learning to regret some steps in the solution construction process is helpful to the training efficiency and network representations. This article proposes a novel regret-based mechanism for an advanced solution construction process. Our method can be applied as a plug-in to any existing LCH-based DRL-NCO method. Experimental results demonstrate the capability of our work to enhance the performance of various NCO models. Results also show that the proposed LCH-Regret outperforms the previous modification methods on several typical combinatorial optimization problems. The code and Supplementary File are available at https://github.com/SunnyR7/LCH-Regret. \ No newline at end of file diff --git a/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images b/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images new file mode 100644 index 0000000000..40feb8be9e --- /dev/null +++ b/data/2024/aaai/Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images @@ -0,0 +1 @@ +Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html. \ No newline at end of file diff --git a/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data b/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data new file mode 100644 index 0000000000..4d69602d28 --- /dev/null +++ b/data/2024/aaai/Learning Fair Policies for Multi-Stage Selection Problems from Observational Data @@ -0,0 +1 @@ +We consider the problem of learning fair policies for multi-stage selection problems from observational data. This problem arises in several high-stakes domains such as company hiring, loan approval, or bail decisions where outcomes (e.g., career success, loan repayment, recidivism) are only observed for those selected. We propose a multi-stage framework that can be augmented with various fairness constraints, such as demographic parity or equal opportunity. This problem is a highly intractable infinite chance-constrained program involving the unknown joint distribution of covariates and outcomes. Motivated by the potential impact of selection decisions on people’s lives and livelihoods, we propose to focus on interpretable linear selection rules. Leveraging tools from causal inference and sample average approximation, we obtain an asymptotically consistent solution to this selection problem by solving a mixed binary conic optimization problem, which can be solved using standard off-the-shelf solvers. We conduct extensive computational experiments on a variety of datasets adapted from the UCI repository on which we show that our proposed approaches can achieve an 11.6% improvement in precision and a 38% reduction in the measure of unfairness compared to the existing selection policy. \ No newline at end of file diff --git a/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making b/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making new file mode 100644 index 0000000000..3a4572f56e --- /dev/null +++ b/data/2024/aaai/Learning GAI-Decomposable Utility Models for Multiattribute Decision Making @@ -0,0 +1 @@ +We propose an approach to learn a multiattribute utility function to model, explain or predict the value system of a Decision Maker. The main challenge of the modelling task is to describe human values and preferences in the presence of interacting attributes while keeping the utility function as simple as possible. We focus on the generalized additive decomposable utility model which allows interactions between attributes while preserving some additive decomposability of the evaluation model. We present a learning approach able to identify the factors of interacting attributes and to learn the utility functions defined on these factors. This approach relies on the determination of a sparse representation of the ANOVA decomposition of the multiattribute utility function using multiple kernel learning. It applies to both continuous and discrete attributes. Numerical tests are performed to demonstrate the practical efficiency of the learning approach. \ No newline at end of file diff --git a/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning b/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning new file mode 100644 index 0000000000..274704ab90 --- /dev/null +++ b/data/2024/aaai/Learning Generalizable and Composable Abstractions for Transfer in Reinforcement Learning @@ -0,0 +1 @@ +Reinforcement Learning (RL) in complex environments presents many challenges: agents require learning concise representations of both environments and behaviors for efficient reasoning and generalizing experiences to new, unseen situations. However, RL approaches can be sample-inefficient and difficult to scale, especially in long-horizon sparse reward settings. To address these issues, the goal of my doctoral research is to develop methods that automatically construct semantically meaningful state and temporal abstractions for efficient transfer and generalization. In my work, I develop hierarchical approaches for learning transferable, generalizable knowledge in the form of symbolically represented options, as well as for integrating search techniques with RL to solve new problems by efficiently composing the learned options. Empirical results show that the resulting approaches effectively learn and transfer knowledge, achieving superior sample efficiency compared to SOTA methods while also enhancing interpretability. \ No newline at end of file diff --git a/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries b/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries new file mode 100644 index 0000000000..208b180dfa --- /dev/null +++ b/data/2024/aaai/Learning Generalized Medical Image Segmentation from Decoupled Feature Queries @@ -0,0 +1,5 @@ +Domain generalized medical image segmentation requires models to learn from multiple source domains and generalize well to arbitrary unseen target domain. Such a task is both technically challenging and clinically practical, due to the domain shift problem (i.e., images are collected from different hospitals and scanners). Existing methods focused on either learning shape-invariant representation or reaching consensus among the source domains. An ideal generalized representation is supposed to show similar pattern responses within the same channel for cross-domain images. +However, to deal with the significant distribution discrepancy, the network tends to capture similar patterns by multiple channels, while different cross-domain patterns are also allowed to rest in the same channel. +To address this issue, we propose to leverage channel-wise decoupled deep features as queries. With the aid of cross-attention mechanism, the long-range dependency between deep and shallow features can be fully mined via self-attention and then guides the learning of generalized representation. Besides, a relaxed deep whitening transformation is proposed to learn channel-wise decoupled features in a feasible way. The proposed decoupled fea- +ture query (DFQ) scheme can be seamlessly integrate into the Transformer segmentation model in an end-to-end manner. +Extensive experiments show its state-of-the-art performance, notably outperforming the runner-up by 1.31% and 1.98% with DSC metric on generalized fundus and prostate benchmarks, respectively. Source code is available at https://github.com/BiQiWHU/DFQ. \ No newline at end of file diff --git a/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance b/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance new file mode 100644 index 0000000000..2e0b3ca8fb --- /dev/null +++ b/data/2024/aaai/Learning Generalized Segmentation for Foggy-Scenes by Bi-directional Wavelet Guidance @@ -0,0 +1,12 @@ +Learning scene semantics that can be well generalized to foggy conditions is important for safety-crucial applications such as autonomous driving. +Existing methods need both annotated clear images and foggy images to train a curriculum domain adaptation model. +Unfortunately, these methods can only generalize to the target foggy domain that has seen in the training stage, but the foggy domains vary a lot in both urban-scene styles and fog styles. +In this paper, we propose to learn scene segmentation well generalized to foggy-scenes under the domain generalization setting, which does not involve any foggy images in the training stage and can generalize to any arbitrary unseen foggy scenes. +We argue that an ideal segmentation model that can be well generalized to foggy-scenes need to simultaneously enhance the content, de-correlate the urban-scene style and de-correlate the fog style. +As the content (e.g., scene semantic) rests more in low-frequency features while the style of urban-scene and fog rests more in high-frequency features, we propose a novel bi-directional wavelet guidance (BWG) mechanism to realize the above three objectives in a divide-and-conquer manner. +With the aid of Haar wavelet transformation, +the low frequency component is concentrated on the content enhancement self-attention, while the high frequency component is shifted to the style and fog self-attention for de-correlation purpose. +It is integrated into existing mask-level Transformer segmentation pipelines in a learnable fashion. +Large-scale experiments are conducted on four foggy-scene segmentation datasets under a variety of interesting settings. +The proposed method significantly outperforms existing directly-supervised, curriculum domain adaptation and domain generalization segmentation methods. +Source code is available at https://github.com/BiQiWHU/BWG. \ No newline at end of file diff --git a/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models b/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models new file mode 100644 index 0000000000..fdb8e4dea8 --- /dev/null +++ b/data/2024/aaai/Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models @@ -0,0 +1 @@ +Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT. \ No newline at end of file diff --git a/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States b/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States new file mode 100644 index 0000000000..14b89e1c5c --- /dev/null +++ b/data/2024/aaai/Learning Hybrid Dynamics Models with Simulator-Informed Latent States @@ -0,0 +1 @@ +Dynamics model learning deals with the task of inferring unknown dynamics from measurement data and predicting the future behavior of the system. A typical approach to address this problem is to train recurrent models. However, predictions with these models are often not physically meaningful. Further, they suffer from deteriorated behavior over time due to accumulating errors. Often, simulators building on first principles are available being physically meaningful by design. However, modeling simplifications typically cause inaccuracies in these models. Consequently, hybrid modeling is an emerging trend that aims to combine the best of both worlds. In this paper, we propose a new approach to hybrid modeling, where we inform the latent states of a learned model via a black-box simulator. This allows to control the predictions via the simulator preventing them from accumulating errors. This is especially challenging since, in contrast to previous approaches, access to the simulator's latent states is not available. We tackle the task by leveraging observers, a well-known concept from control theory, inferring unknown latent states from observations and dynamics over time. In our learning-based setting, we jointly learn the dynamics and an observer that infers the latent states via the simulator. Thus, the simulator constantly corrects the latent states, compensating for modeling mismatch caused by learning. To maintain flexibility, we train an RNN-based residuum for the latent states that cannot be informed by the simulator. \ No newline at end of file diff --git "a/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data" "b/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data" new file mode 100644 index 0000000000..14385c9555 --- /dev/null +++ "b/data/2024/aaai/Learning Image Demoir\303\251ing from Unpaired Real Data" @@ -0,0 +1 @@ +This paper focuses on addressing the issue of image demoiréing. Unlike the large volume of existing studies that rely on learning from paired real data, we attempt to learn a demoiréing model from unpaired real data, i.e., moiré images associated with irrelevant clean images. The proposed method, referred to as Unpaired Demoiréing(UnDeM), synthesizes pseudo moiré images from unpaired datasets, generating pairs with clean images for training demoiréing models. To achieve this, we divide real moiré images into patches and group them in compliance with their moiré complexity. We introduce a novel moiré generation framework to synthesize moiré images with diverse moiré features, resembling real moiré patches, and details akin to real moiré-free images. Additionally, we introduce an adaptive denoise method to eliminate the low-quality pseudo moiré images that adversely impact the learning of demoiréing models. We conduct extensive experiments on the commonly-used FHDMi and UHDM datasets. Results manifest that our UnDeM performs better than existing methods when using existing demoiréing models such as MBCNN and ESDNet-L. Code: https://github.com/zysxmu/UnDeM. \ No newline at end of file diff --git a/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation b/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation new file mode 100644 index 0000000000..6b5cc0f3c3 --- /dev/null +++ b/data/2024/aaai/Learning Invariant Inter-pixel Correlations for Superpixel Generation @@ -0,0 +1 @@ +Deep superpixel algorithms have made remarkable strides by substituting hand-crafted features with learnable ones. Nevertheless, we observe that existing deep superpixel methods, serving as mid-level representation operations, remain sensitive to the statistical properties (e.g., color distribution, high-level semantics) embedded within the training dataset. Consequently, learnable features exhibit constrained discriminative capability, resulting in unsatisfactory pixel grouping performance, particularly in untrainable application scenarios. To address this issue, we propose the Content Disentangle Superpixel (CDS) algorithm to selectively separate the invariant inter-pixel correlations and statistical properties, i.e., style noise. Specifically, We first construct auxiliary modalities that are homologous to the original RGB image but have substantial stylistic variations. Then, driven by mutual information, we propose the local-grid correlation alignment across modalities to reduce the distribution discrepancy of adaptively selected features and learn invariant inter-pixel correlations. Afterwards, we perform global-style mutual information minimization to enforce the separation of invariant content and train data styles. The experimental results on four benchmark datasets demonstrate the superiority of our approach to existing state-of-the-art methods, regarding boundary adherence, generalization, and efficiency. Code and pre-trained model are available at https://github.com/rookiie/CDSpixel. \ No newline at end of file diff --git a/data/2024/aaai/Learning MDL Logic Programs from Noisy Data b/data/2024/aaai/Learning MDL Logic Programs from Noisy Data new file mode 100644 index 0000000000..46ea7a8942 --- /dev/null +++ b/data/2024/aaai/Learning MDL Logic Programs from Noisy Data @@ -0,0 +1 @@ +Many inductive logic programming approaches struggle to learn programs from noisy data. To overcome this limitation, we introduce an approach that learns minimal description length programs from noisy data, including recursive programs. Our experiments on several domains, including drug design, game playing, and program synthesis, show that our approach can outperform existing approaches in terms of predictive accuracies and scale to moderate amounts of noise. \ No newline at end of file diff --git a/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution b/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution new file mode 100644 index 0000000000..3162fd3393 --- /dev/null +++ b/data/2024/aaai/Learning Multi-Modal Cross-Scale Deformable Transformer Network for Unregistered Hyperspectral Image Super-resolution @@ -0,0 +1 @@ +Hyperspectral image super-resolution (HSI-SR) is a technology to improve the spatial resolution of HSI. Existing fusion-based SR methods have shown great performance, but still have some problems as follows: 1) existing methods assume that the auxiliary image providing spatial information is strictly registered with the HSI, but images are difficult to be registered finely due to the shooting platforms, shooting viewpoints and the influence of atmospheric turbulence; 2) most of the methods are based on convolutional neural networks (CNNs), which is effective for local features but cannot utilize the global features. To this end, we propose a multi-modal cross-scale deformable transformer network (M2DTN) to achieve unregistered HSI-SR. Specifically, we formulate a spectrum-preserving based spatial-guided registration-SR unified model (SSRU) from the view of the realistic degradation scenarios. According to SSRU, we propose multi-modal registration deformable module (MMRD) to align features between different modalities by deformation field. In order to efficiently utilize the unique information between different modals, we design multi-scale feature transformer (MSFT) to emphasize the spatial-spectral features at different scales. In addition, we propose the cross-scale feature aggregation module (CSFA) to accurately reconstruct the HSI by aggregating feature information at different scales. Experiments show that M2DTN outperforms the-state-of-the-art HSI-SR methods. Code is obtainable at https://github.com/Jiahuiqu/M2DTN. \ No newline at end of file diff --git a/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication b/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication new file mode 100644 index 0000000000..92b1f098a0 --- /dev/null +++ b/data/2024/aaai/Learning Multi-Object Positional Relationships via Emergent Communication @@ -0,0 +1 @@ +The study of emergent communication has been dedicated to interactive artificial intelligence. While existing work focuses on communication about single objects or complex image scenes, we argue that communicating relationships between multiple objects is important in more realistic tasks, but understudied. In this paper, we try to fill this gap and focus on emergent communication about positional relationships between two objects. We train agents in the referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved. The key factor affecting the generalization ability of the emergent language is the input variation between Speaker and Listener, which is realized by a random image generator in our work. Further, we find that the learned language can generalize well in a new multi-step MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features, verifying the strong generalization ability of discrete sequences. We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games. All in all, our experiments demonstrate the viability and merit of having agents learn to communicate positional relationships between multiple objects through emergent communication. \ No newline at end of file diff --git a/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding b/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding new file mode 100644 index 0000000000..1f6cd466b0 --- /dev/null +++ b/data/2024/aaai/Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding @@ -0,0 +1 @@ +Weakly Supervised temporal Article Grounding (WSAG) is a challenging and practical task in video understanding. Specifically, given a video and a relevant article, whose sentences are at different semantic scales, WSAG aims to localize corresponding video segments for all “groundable” sentences. Compared to other grounding tasks, e.g., localizing one target segment with respect to a given sentence query, WSAG confronts an essential obstacle rooted in the intricate multi-scale information inherent within both textual and visual modalities. Existing methods overlook the modeling and alignment of such structured information present in multi-scale video segments and hierarchical textual content. To this end, we propose a Multi-Scale Video-Text Correspondence Learning (MVTCL) framework, which enhances the grounding performance in complex scenes by modeling multi-scale semantic correspondence both within and between modalities. Specifically, MVTCL initially aggregates video content spanning distinct temporal scales and leverages hierarchical textual relationships in both temporal and semantic dimensions via a semantic calibration module. Then multi-scale contrastive learning module is introduced to generate more discriminative representations by selecting typical contexts and performing inter-video contrastive learning. Through the multi-scale semantic calibration architecture and supervision design, our method achieves new state-of-the-art performance on existing WSAG benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information b/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information new file mode 100644 index 0000000000..aec32bd342 --- /dev/null +++ b/data/2024/aaai/Learning Multi-Task Sparse Representation Based on Fisher Information @@ -0,0 +1 @@ +Multi-task learning deals with multiple related tasks simultaneously by sharing knowledge. In a typical deep multi-task learning model, all tasks use the same feature space and share the latent knowledge. If the tasks are weakly correlated or some features are negatively correlated, sharing all knowledge often leads to negative knowledge transfer among. To overcome this issue, this paper proposes a Fisher sparse multi-task learning method. It can obtain a sparse sharing representation for each task. In such a way, tasks share features on a sparse subspace. Our method can ensure that the knowledge transferred among tasks is beneficial. Specifically, we first propose a sparse deep multi-task learning model, and then introduce Fisher sparse module into traditional deep multi-task learning to learn the sparse variables of task. By alternately updating the neural network parameters and sparse variables, a sparse sharing representation can be learned for each task. In addition, in order to reduce the computational overhead, an heuristic method is used to estimate the Fisher information of neural network parameters. Experimental results show that, comparing with other methods, our proposed method can improve the performance for all tasks, and has high sparsity in multi-task learning. \ No newline at end of file diff --git a/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing b/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing new file mode 100644 index 0000000000..9c82f2ab69 --- /dev/null +++ b/data/2024/aaai/Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing @@ -0,0 +1 @@ +The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between over-segmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing datasets for neuron segment connection. To learn sophisticated biological imaging features from the connectivity annotations, we propose a novel connectivity-aware contrastive learning method to generate dense volumetric EM image embedding. The learned embeddings can be easily incorporated with any point or voxel-based morphological representations for automatic neuron tracing. Extensive comparisons of different combination schemes of image and morphological representation in identifying split errors across the whole fly brain demonstrate the superiority of the proposed approach, especially for the locations that contain severe imaging artifacts, such as section missing and misalignment. The dataset and code are available at https://github.com/Levishery/Flywire-Neuron-Tracing. \ No newline at end of file diff --git a/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning b/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning new file mode 100644 index 0000000000..01a62f6ffa --- /dev/null +++ b/data/2024/aaai/Learning Neuro-Symbolic Abstractions for Robot Planning and Learning @@ -0,0 +1 @@ +Although state-of-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts. On the other hand, non-hierarchical robot planning approaches fail to compute solutions for complex tasks that require reasoning over a long horizon. My research addresses these problems by proposing an approach for learning abstractions and developing hierarchical planners that efficiently use learned abstractions to boost robot planning performance and provide strong guarantees of reliability. \ No newline at end of file diff --git a/data/2024/aaai/Learning Not to Regret b/data/2024/aaai/Learning Not to Regret new file mode 100644 index 0000000000..c8962c1467 --- /dev/null +++ b/data/2024/aaai/Learning Not to Regret @@ -0,0 +1,7 @@ +The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play. +Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. +As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. +We present a novel ``learning not to regret'' framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. +Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. +We validated our algorithms' faster convergence on a distribution of river poker games. +Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements. \ No newline at end of file diff --git a/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification b/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification new file mode 100644 index 0000000000..db58debb71 --- /dev/null +++ b/data/2024/aaai/Learning Only When It Matters: Cost-Aware Long-Tailed Classification @@ -0,0 +1,2 @@ +Most current long-tailed classification approaches assume the cost-agnostic scenario, where the training distribution of classes is long-tailed while the testing distribution of classes is balanced. Meanwhile, the misclassification costs of all instances are the same. On the other hand, in many real-world applications, it is more proper to assume that the training and testing distributions of classes are the same, while the misclassification cost of tail-class instances is varied. In this work, we model such a scenario as cost-aware long-tailed classification, in which the identification of high-cost tail instances and focusing learning on them thereafter is essential. In consequence, we propose the learning strategy of augmenting new instances based on adaptive region partition in the feature space. We conduct theoretical analysis to show that under the assumption +that the feature-space distance and the misclassification cost are correlated, the identification of high-cost tail instances can be realized by building region partitions with a low variance of risk within each region. The resulting AugARP approach could significantly outperform baseline approaches on both benchmark datasets and real-world product sales datasets. \ No newline at end of file diff --git a/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data b/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data new file mode 100644 index 0000000000..37cd766ea1 --- /dev/null +++ b/data/2024/aaai/Learning Pattern-Based Extractors from Natural Language and Knowledge Graphs: Applying Large Language Models to Wikipedia and Linked Open Data @@ -0,0 +1 @@ +Seq-to-seq transformer models have recently been successfully used for relation extraction, showing their flexibility, effectiveness, and scalability on that task. In this context, knowledge graphs aligned with Wikipedia such as DBpedia and Wikidata give us the opportunity to leverage existing texts and corresponding RDF graphs in order to extract, from these texts, the knowledge that is missing in the corresponding graphs and meanwhile improve their coverage. The goal of my thesis is to learn efficient extractors targeting specific RDF patterns and to do so by leveraging the latest language models and the dual base formed by Wikipedia on the one hand, and DBpedia and Wikidata on the other hand. \ No newline at end of file diff --git a/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees b/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees new file mode 100644 index 0000000000..974e061327 --- /dev/null +++ b/data/2024/aaai/Learning Performance Maximizing Ensembles with Explainability Guarantees @@ -0,0 +1 @@ +In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining 74% of observations on average), and in some cases even outperform both the component explainable and black box models while improving explainability. \ No newline at end of file diff --git a/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis b/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis new file mode 100644 index 0000000000..6ac989a8fd --- /dev/null +++ b/data/2024/aaai/Learning Persistent Community Structures in Dynamic Networks via Topological Data Analysis @@ -0,0 +1 @@ +Dynamic community detection methods often lack effective mechanisms to ensure temporal consistency, hindering the analysis of network evolution. In this paper, we propose a novel deep graph clustering framework with temporal consistency regularization on inter-community structures, inspired by the concept of minimal network topological changes within short intervals. Specifically, to address the representation collapse problem, we first introduce MFC, a matrix factorization-based deep graph clustering algorithm that preserves node embedding. Based on static clustering results, we construct probabilistic community networks and compute their persistence homology, a robust topological measure, to assess structural similarity between them. Moreover, a novel neural network regularization TopoReg is introduced to ensure the preservation of topological similarity between inter-community structures over time intervals. Our approach enhances temporal consistency and clustering accuracy on real-world datasets with both fixed and varying numbers of communities. It is also a pioneer application of TDA in temporally persistent community detection, offering an insightful contribution to field of network analysis. Code and data are available at the public git repository: https://github.com/kundtx/MFC-TopoReg. \ No newline at end of file diff --git a/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis b/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis new file mode 100644 index 0000000000..e5ebd0c33c --- /dev/null +++ b/data/2024/aaai/Learning Planning Domains from Non-redundant Fully-Observed Traces: Theoretical Foundations and Complexity Analysis @@ -0,0 +1,8 @@ +Domain learning is the task of finding an action model that can explain given observed plan executions, so-called traces. +It allows us to automate the identification of actions' preconditions and effects instead of relying on hand-modeled expert knowledge. +While previous research has put forth various techniques and covers multiple planning formalisms, the theoretical foundations of domain learning are still in their infancy. + +We investigate the most basic setting, that is grounded classical planning without negative preconditions or conditional effects with full observability of the state variables. +The given traces are assumed to be justified in the sense that either no single action or no set of actions can be removed without violating correctness of the plan. +Furthermore, we might be given additional constraints in the form of a propositional logical formula. +We show the consequences of these assumptions for the computational complexity of identifying a satisfactory planning domain. \ No newline at end of file diff --git a/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract) b/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract) new file mode 100644 index 0000000000..ecb95907e5 --- /dev/null +++ b/data/2024/aaai/Learning Random Noise Salient Feature Fusion Siamese Network for Low-Resolution Object Tracking (Student Abstract) @@ -0,0 +1 @@ +Despite Siamese trackers’ substantial potential, they offer sub-optimal tracking performance in low-resolution (LR) contexts. We introduce a Random Noise Salient Feature Fusion Learning Network to address this issue. This method integrates random noise-infused feature maps into a similaritylearning matching model. This integration acts as an effective regularization technique, enhancing the network’s generalization capabilities in LR environments. Additionally, by integrating attention mechanisms, we enhance the discriminative ability of the network, assigning more weights to important features. This directs the network’s focus toward the most salient regions of the feature map, ensuring improved accuracy without a significant increase in parameter overhead, and maintaining a high operating speed. To validate the effectiveness of our method, we performed qualitative and quantitative comparisons with state-of-the-art (SOTA) trackers. \ No newline at end of file diff --git a/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision b/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision new file mode 100644 index 0000000000..825c592df0 --- /dev/null +++ b/data/2024/aaai/Learning Real-World Image De-weathering with Imperfect Supervision @@ -0,0 +1 @@ +Real-world image de-weathering aims at removing various undesirable weather-related artifacts. Owing to the impossibility of capturing image pairs concurrently, existing real-world de-weathering datasets often exhibit inconsistent illumination, position, and textures between the ground-truth images and the input degraded images, resulting in imperfect supervision. Such non-ideal supervision negatively affects the training process of learning-based de-weathering methods. In this work, we attempt to address the problem with a unified solution for various inconsistencies. Specifically, inspired by information bottleneck theory, we first develop a Consistent Label Constructor (CLC) to generate a pseudo-label as consistent as possible with the input degraded image while removing most weather-related degradation. In particular, multiple adjacent frames of the current input are also fed into CLC to enhance the pseudo-label. Then we combine the original imperfect labels and pseudo-labels to jointly supervise the de-weathering model by the proposed Information Allocation Strategy (IAS). During testing, only the de-weathering model is used for inference. Experiments on two real-world de-weathering datasets show that our method helps existing de-weathering models achieve better performance. Code is available at https://github.com/1180300419/imperfect-deweathering. \ No newline at end of file diff --git a/data/2024/aaai/Learning Reduced Fluid Dynamics b/data/2024/aaai/Learning Reduced Fluid Dynamics new file mode 100644 index 0000000000..6c391ebaac --- /dev/null +++ b/data/2024/aaai/Learning Reduced Fluid Dynamics @@ -0,0 +1 @@ +Predicting the state evolution of ultra high-dimensional, time-reversible fluid dynamic systems is a crucial but computationally expensive task. Existing physics-informed neural networks either incur high inference cost or cannot preserve the time-reversible nature of the underlying dynamics system. We propose a model-based approach to identify low-dimensional, time reversible, nonlinear fluid dynamic systems. Our method utilizes the symplectic structure of reduced Eulerian fluid and use stochastic Riemann optimization to obtain a low-dimensional bases that minimize the expected trajectory-wise dimension-reduction error over a given distribution of initial conditions. We show that such minimization is well-defined since the reduced trajectories are differentiable with respect to the subspace bases over the entire Grassmannian manifold, under proper choices of timestep sizes and numerical integrators. Finally, we propose a loss function measuring the trajectory-wise discrepancy between the original and reduced models. By tensor precomputation, we show that gradient information of such loss function can be evaluated efficiently over a long trajectory without time-integrating the high-dimensional dynamic system. Through evaluations on a row of simulation benchmarks, we show that our method reduces the discrepancy by 50-90 percent over conventional reduced models and we outperform PINNs by exactly preserving the time reversibility. \ No newline at end of file diff --git a/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction b/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction new file mode 100644 index 0000000000..1fd2673394 --- /dev/null +++ b/data/2024/aaai/Learning Representations for Robust Human-Robot Interaction @@ -0,0 +1 @@ +For robots to robustly and flexibly interact with humans, they need to acquire skills to use across scenarios. One way to enable the generalization of skills is to learn representations that are useful for downstream tasks. Learning a representation for interactions requires an understanding of what (e.g., objects) as well as how (e.g., actions, controls, and manners) to interact with. However, most existing language or visual representations mainly focus on objects. To enable robust human-robot interactions, we need a representation that is not just grounded at the object level but to reason at the action level. The ability to reason about an agent’s own actions and other’s actions will be crucial for long-tail interactions. My research focuses on leveraging the compositional nature of language and reward functions to learn representations that generalize to novel scenarios. Together with the information from multiple modalities, the learned representation can reason about task progress, future behaviors, and the goals/beliefs of an agent. The above ideas have been demonstrated in my research on building robots to understand language and engage in social interactions. \ No newline at end of file diff --git a/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning b/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning new file mode 100644 index 0000000000..9eff975333 --- /dev/null +++ b/data/2024/aaai/Learning Representations on the Unit Sphere: Investigating Angular Gaussian and Von Mises-Fisher Distributions for Online Continual Learning @@ -0,0 +1 @@ +We use the maximum a posteriori estimation principle for learning representations distributed on the unit sphere. We propose to use the angular Gaussian distribution, which corresponds to a Gaussian projected on the unit-sphere and derive the associated loss function. We also consider the von Mises-Fisher distribution, which is the conditional of a Gaussian in the unit-sphere. The learned representations are pushed toward fixed directions, which are the prior means of the Gaussians; allowing for a learning strategy that is resilient to data drift. This makes it suitable for online continual learning, which is the problem of training neural networks on a continuous data stream, where multiple classification tasks are presented sequentially so that data from past tasks are no longer accessible, and data from the current task can be seen only once. To address this challenging scenario, we propose a memory-based representation learning technique equipped with our new loss functions. Our approach does not require negative data or knowledge of task boundaries and performs well with smaller batch sizes while being computationally efficient. We demonstrate with extensive experiments that the proposed method outperforms the current state-of-the-art methods on both standard evaluation scenarios and realistic scenarios with blurry task boundaries. For reproducibility, we use the same training pipeline for every compared method and share the code at https://github.com/Nicolas1203/ocl-fd. \ No newline at end of file diff --git a/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach b/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach new file mode 100644 index 0000000000..e4d449453d --- /dev/null +++ b/data/2024/aaai/Learning Robust Rationales for Model Explainability: A Guidance-Based Approach @@ -0,0 +1 @@ +Selective rationalization can be regarded as a straightforward self-explaining approach for enhancing model explainability in natural language processing tasks. It aims to provide explanations that are more accessible and understandable to non-technical users by first selecting subsets of input texts as rationales and then predicting based on chosen subsets. However, existing methods that follow this select-then-predict framework may suffer from the rationalization degeneration problem, resulting in sub-optimal or unsatisfactory rationales that do not align with human judgments. This problem may further lead to rationalization failure, resulting in meaningless rationales that ultimately undermine people's trust in the rationalization model. To address these challenges, we propose a Guidance-based Rationalization method (G-RAT) that effectively improves robustness against failure situations and the quality of rationales by using a guidance module to regularize selections and distributions. Experimental results on two synthetic settings prove that our method is robust to the rationalization degeneration and failure problems, while the results on two real datasets show its effectiveness in providing rationales in line with human judgments. The source code is available at https://github.com/shuaibo919/g-rat. \ No newline at end of file diff --git a/data/2024/aaai/Learning Safe Action Models with Partial Observability b/data/2024/aaai/Learning Safe Action Models with Partial Observability new file mode 100644 index 0000000000..3187d3d37a --- /dev/null +++ b/data/2024/aaai/Learning Safe Action Models with Partial Observability @@ -0,0 +1,5 @@ +A common approach for solving planning problems is to model them in a formal language such as the Planning Domain Definition Language (PDDL), and then use an appropriate PDDL planner. +Several algorithms for learning PDDL models from observations have been proposed but plans created with these learned models may not be sound. +We propose two algorithms for learning PDDL models that are guaranteed to be safe to use even when given observations that include partially observable states. +We analyze these algorithms theoretically, characterizing the sample complexity each algorithm requires to guarantee probabilistic completeness. +We also show experimentally that our algorithms are often better than FAMA, a state-of-the-art PDDL learning algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width b/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width new file mode 100644 index 0000000000..c8e482387c --- /dev/null +++ b/data/2024/aaai/Learning Small Decision Trees for Data of Low Rank-Width @@ -0,0 +1,11 @@ +We consider the NP-hard problem of finding a smallest decision tree +representing a classification instance in terms of a partially defined +Boolean function. Small decision trees are desirable to provide an +interpretable model for the given data. We show that the problem is +fixed-parameter tractable when parameterized by the rank-width of the +incidence graph of the given classification instance. Our algorithm +proceeds by dynamic programming using an NLC decomposition obtained +from a rank-width decomposition. The key to the algorithm is a +succinct representation of partial solutions. This allows us to limit +the space and time requirements for each dynamic programming step in +terms of the parameter. \ No newline at end of file diff --git a/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective b/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective new file mode 100644 index 0000000000..b0ca8c55d2 --- /dev/null +++ b/data/2024/aaai/Learning Small Decision Trees with Few Outliers: A Parameterized Perspective @@ -0,0 +1,2 @@ +Decision trees is a fundamental tool in machine learning for representing, classifying, and generalizing data. It is desirable to construct ``small'' decision trees, by minimizing either the size (s) or the depth (d) of the decision tree (DT). Recently, the parameterized complexity of Decision Tree Learning has attracted a lot of attention. +We consider a generalization of Decision Tree Learning where given a classification instance E and an integer t, the task is to find a ``small'' DT that disagrees with E in at most t examples. We consider two problems: DTSO and DTDO, where the goal is to construct a DT minimizing s and d, respectively. We first establish that both DTSO and DTDO are W[1]-hard when parameterized by s+y and d+y, respectively, where y is the maximum number of features in which two differently labeled examples can differ. We complement this result by showing that these problems become FPT if we include the parameter t. We also consider the kernelization complexity of these problems and establish several positive and negative results for both DTSO and DTDO. \ No newline at end of file diff --git a/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation b/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation new file mode 100644 index 0000000000..2b42f65509 --- /dev/null +++ b/data/2024/aaai/Learning Spatially Collaged Fourier Bases for Implicit Neural Representation @@ -0,0 +1 @@ +Existing approaches to Implicit Neural Representation (INR) can be interpreted as a global scene representation via a linear combination of Fourier bases of different frequencies. However, such universal basis functions can limit the representation capability in local regions where a specific component is unnecessary, resulting in unpleasant artifacts. To this end, we introduce a learnable spatial mask that effectively dispatches distinct Fourier bases into respective regions. This translates into collaging Fourier patches, thus enabling an accurate representation of complex signals. Comprehensive experiments demonstrate the superior reconstruction quality of the proposed approach over existing baselines across various INR tasks, including image fitting, video representation, and 3D shape representation. Our method outperforms all other baselines, improving the image fitting PSNR by over 3dB and 3D reconstruction to 98.81 IoU and 0.0011 Chamfer Distance. \ No newline at end of file diff --git a/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos b/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos new file mode 100644 index 0000000000..e9be4c99c5 --- /dev/null +++ b/data/2024/aaai/Learning Subject-Aware Cropping by Outpainting Professional Photos @@ -0,0 +1 @@ +How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subject-aware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is to combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity, and its images serve as pseudo-labels for a good crop. The text-image diffusion model is used to out-paint (i.e., outward inpainting) realistic uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics. \ No newline at end of file diff --git a/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection b/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection new file mode 100644 index 0000000000..b70899fa73 --- /dev/null +++ b/data/2024/aaai/Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection @@ -0,0 +1 @@ +Class-incremental object detection (CIOD) is a real-world desired capability, requiring an object detector to continuously adapt to new tasks without forgetting learned ones, with the main challenge being catastrophic forgetting. Many methods based on distillation and replay have been proposed to alleviate this problem. However, they typically learn on a pure visual backbone, neglecting the powerful representation capabilities of textual cues, which to some extent limits their performance. In this paper, we propose task-aware language-image representation to mitigate catastrophic forgetting, introducing a new paradigm for language-image-based CIOD. First of all, we demonstrate the significant advantage of language-image detectors in mitigating catastrophic forgetting. Secondly, we propose a learning task-aware language-image representation method that overcomes the existing drawback of directly utilizing the language-image detector for CIOD. More specifically, we learn the language-image representation of different tasks through an insulating approach in the training stage, while using the alignment scores produced by task-specific language-image representation in the inference stage. Through our proposed method, language-image detectors can be more practical for CIOD. We conduct extensive experiments on COCO 2017 and Pascal VOC 2007 and demonstrate that the proposed method achieves state-of-the-art results under the various CIOD settings. \ No newline at end of file diff --git a/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification b/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification new file mode 100644 index 0000000000..6ed24475f3 --- /dev/null +++ b/data/2024/aaai/Learning Temporal Resolution in Spectrogram for Audio Classification @@ -0,0 +1 @@ +The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost. \ No newline at end of file diff --git a/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation b/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation new file mode 100644 index 0000000000..ae6e69336b --- /dev/null +++ b/data/2024/aaai/Learning Time Slot Preferences via Mobility Tree for Next POI Recommendation @@ -0,0 +1 @@ +Next Point-of-Interests (POIs) recommendation task aims to provide a dynamic ranking of POIs based on users' current check-in trajectories. The recommendation performance of this task is contingent upon a comprehensive understanding of users' personalized behavioral patterns through Location-based Social Networks (LBSNs) data. While prior studies have adeptly captured sequential patterns and transitional relationships within users' check-in trajectories, a noticeable gap persists in devising a mechanism for discerning specialized behavioral patterns during distinct time slots, such as noon, afternoon, or evening. In this paper, we introduce an innovative data structure termed the ``Mobility Tree'', tailored for hierarchically describing users' check-in records. The Mobility Tree encompasses multi-granularity time slot nodes to learn user preferences across varying temporal periods. Meanwhile, we propose the Mobility Tree Network (MTNet), a multitask framework for personalized preference learning based on Mobility Trees. We develop a four-step node interaction operation to propagate feature information from the leaf nodes to the root node. Additionally, we adopt a multitask training strategy to push the model towards learning a robust representation. The comprehensive experimental results demonstrate the superiority of MTNet over eleven state-of-the-art next POI recommendation models across three real-world LBSN datasets, substantiating the efficacy of time slot preference learning facilitated by Mobility Tree. \ No newline at end of file diff --git a/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression b/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression new file mode 100644 index 0000000000..c4287a48c6 --- /dev/null +++ b/data/2024/aaai/Learning Ultrametric Trees for Optimal Transport Regression @@ -0,0 +1 @@ +Optimal transport provides a metric which quantifies the dissimilarity between probability measures. For measures supported in discrete metric spaces, finding the optimal transport distance has cubic time complexity in the size of the space. However, measures supported on trees admit a closed-form optimal transport that can be computed in linear time. In this paper, we aim to find an optimal tree structure for a given discrete metric space so that the tree-Wasserstein distance approximates the optimal transport distance in the original space. One of our key ideas is to cast the problem in ultrametric spaces. This helps us optimize over the space of ultrametric trees --- a mixed-discrete and continuous optimization problem --- via projected gradient decent over the space of ultrametric matrices. During optimization, we project the parameters to the ultrametric space via a hierarchical minimum spanning tree algorithm, equivalent to the closest projection to ultrametrics under the supremum norm. Experimental results on real datasets show that our approach outperforms previous approaches (e.g. Flowtree, Quadtree) in approximating optimal transport distances. Finally, experiments on synthetic data generated on ground truth trees show that our algorithm can accurately uncover the underlying trees. \ No newline at end of file diff --git a/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions b/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions new file mode 100644 index 0000000000..3c44fe2ebd --- /dev/null +++ b/data/2024/aaai/Learning Uncertainty-Aware Temporally-Extended Actions @@ -0,0 +1 @@ +In reinforcement learning, temporal abstraction in the action space, exemplified by action repetition, is a technique to facilitate policy learning through extended actions. However, a primary limitation in previous studies of action repetition is its potential to degrade performance, particularly when sub-optimal actions are repeated. This issue often negates the advantages of action repetition. To address this, we propose a novel algorithm named Uncertainty-aware Temporal Extension (UTE). UTE employs ensemble methods to accurately measure uncertainty during action extension. This feature allows policies to strategically choose between emphasizing exploration or adopting an uncertainty-averse approach, tailored to their specific needs. We demonstrate the effectiveness of UTE through experiments in Gridworld and Atari 2600 environments. Our findings show that UTE outperforms existing action repetition algorithms, effectively mitigating their inherent limitations and significantly enhancing policy learning efficiency. \ No newline at end of file diff --git a/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks b/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks new file mode 100644 index 0000000000..c447c89e14 --- /dev/null +++ b/data/2024/aaai/Learning Visual Abstract Reasoning through Dual-Stream Networks @@ -0,0 +1 @@ +Visual abstract reasoning tasks present challenges for deep neural networks, exposing limitations in their capabilities. In this work, we present a neural network model that addresses the challenges posed by Raven’s Progressive Matrices (RPM). Inspired by the two-stream hypothesis of visual processing, we introduce the Dual-stream Reasoning Network (DRNet), which utilizes two parallel branches to capture image features. On top of the two streams, a reasoning module first learns to merge the high-level features of the same image. Then, it employs a rule extractor to handle combinations involving the eight context images and each candidate image, extracting discrete abstract rules and utilizing an multilayer perceptron (MLP) to make predictions. Empirical results demonstrate that the proposed DRNet achieves state-of-the-art average performance across multiple RPM benchmarks. Furthermore, DRNet demonstrates robust generalization capabilities, even extending to various out-of-distribution scenarios. The dual streams within DRNet serve distinct functions by addressing local or spatial information. They are then integrated into the reasoning module, leveraging abstract rules to facilitate the execution of visual reasoning tasks. These findings indicate that the dual-stream architecture could play a crucial role in visual abstract reasoning. \ No newline at end of file diff --git a/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning b/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning new file mode 100644 index 0000000000..3091585232 --- /dev/null +++ b/data/2024/aaai/Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning @@ -0,0 +1 @@ +Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can leverage the explained important relations as guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of existing RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance. To foster further research in self-explanation-guided robot learning, we have made our demonstrations and code publicly accessible at https://github.com/YantianZha/SERLfD. For a deeper understanding of our work, interested readers can refer to our arXiv version at https://arxiv.org/pdf/2110.05286.pdf, including an accompanying appendix. \ No newline at end of file diff --git a/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples b/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples new file mode 100644 index 0000000000..3e895aa2ff --- /dev/null +++ b/data/2024/aaai/Learning from Failure: Improving Meeting Summarization without Good Samples @@ -0,0 +1 @@ +Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains. \ No newline at end of file diff --git a/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration b/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration new file mode 100644 index 0000000000..bb9e0602f3 --- /dev/null +++ b/data/2024/aaai/Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration @@ -0,0 +1 @@ +Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined and task-oriented negatives, which often exhibit pronounced task-specific biases. To address this challenge, our paper introduces an innovative method termed 'learning from history', which dynamically generates negative samples from the target model itself. Our approach, named Model Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models as negative models, making it compatible with diverse image restoration tasks. We propose the Self-Prior guided Negative loss (SPN) to enable it. This approach significantly enhances existing models when retrained with the proposed model contrastive paradigm. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPN outperform the original FFANet and DehazeFormer by 3.41 and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at https://github.com/Aitical/MCLIR. \ No newline at end of file diff --git a/data/2024/aaai/Learning from an Infant's Visual Experience b/data/2024/aaai/Learning from an Infant's Visual Experience new file mode 100644 index 0000000000..10be12afcd --- /dev/null +++ b/data/2024/aaai/Learning from an Infant's Visual Experience @@ -0,0 +1 @@ +Infants see a selective view of the world: they see some objects with high frequency and from a wide range of viewpoints (e.g., their toys during playing) while a much larger set of objects are seen much more rarely and from limited viewpoints (e.g., objects they see outdoors). Extensive, repeated visual experiences with a small number of objects during infancy plays a big role in the development of human visual skills. Internet-style datasets that are commonly used in computer vision research do not contain the regularities that result from such repeated, structured experiences with a few objects. This has led to a dearth of models that learn by exploiting these regularities. In my PhD dissertation, I use deep learning models to investigate how regularities in an infant's visual experience can be leveraged for visual representation learning. \ No newline at end of file diff --git a/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus b/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus new file mode 100644 index 0000000000..cc3734c249 --- /dev/null +++ b/data/2024/aaai/Learning in Online Principal-Agent Interactions: The Power of Menus @@ -0,0 +1 @@ +We study a ubiquitous learning challenge in online principal-agent problems during which the principal learns the agent's private information from the agent's revealed preferences in historical interactions. This paradigm includes important special cases such as pricing and contract design, which have been widely studied in recent literature. However, existing work considers the case where the principal can only choose a single strategy at every round to interact with the agent and then observe the agent's revealed preference through their actions. In this paper, we extend this line of study to allow the principal to offer a menu of strategies to the agent and learn additionally from observing the agent's selection from the menu. We provide a thorough investigation of several online principal-agent problem settings and characterize their sample complexities, accompanied by the corresponding algorithms we have developed. We instantiate this paradigm to several important design problems — including Stackelberg (security) games, contract design, and information design. Finally, we also explore the connection between our findings and existing results about online learning in Stackelberg games, and we offer a solution that can overcome a key hard instance of previous work. \ No newline at end of file diff --git a/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise b/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise new file mode 100644 index 0000000000..7d84728937 --- /dev/null +++ b/data/2024/aaai/Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise @@ -0,0 +1 @@ +This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes -- partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime. \ No newline at end of file diff --git a/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems b/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems new file mode 100644 index 0000000000..315ad6b3db --- /dev/null +++ b/data/2024/aaai/Learning the Topology and Behavior of Discrete Dynamical Systems @@ -0,0 +1 @@ +Discrete dynamical systems are commonly used to model the spread of contagions on real-world networks. Under the PAC framework, existing research has studied the problem of learning the behavior of a system, assuming that the underlying network is known. In this work, we focus on a more challenging setting: to learn both the behavior and the underlying topology of a black-box system. We show that, in general, this learning problem is computationally intractable. On the positive side, we present efficient learning methods under the PAC model when the underlying graph of the dynamical system belongs to certain classes. Further, we examine a relaxed setting where the topology of an unknown system is partially observed. For this case, we develop an efficient PAC learner to infer the system and establish the sample complexity. Lastly, we present a formal analysis of the expressive power of the hypothesis class of dynamical systems where both the topology and behavior are unknown, using the well-known Natarajan dimension formalism. Our results provide a theoretical foundation for learning both the topology and behavior of discrete dynamical systems. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs b/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs new file mode 100644 index 0000000000..83375febf6 --- /dev/null +++ b/data/2024/aaai/Learning to Approximate Adaptive Kernel Convolution on Graphs @@ -0,0 +1 @@ +Various Graph Neural Networks (GNN) have been successful in analyzing data in non-Euclidean spaces, however, they have limitations such as oversmoothing, i.e., information becomes excessively averaged as the number of hidden layers increases. The issue stems from the intrinsic formulation of conventional graph convolution where the nodal features are aggregated from a direct neighborhood per layer across the entire nodes in the graph. As setting different number of hidden layers per node is infeasible, recent works leverage a diffusion kernel to redefine the graph structure and incorporate information from farther nodes. Unfortunately, such approaches suffer from heavy diagonalization of a graph Laplacian or learning a large transform matrix. In this regards, we propose a diffusion learning framework where the range of feature aggregation is controlled by the scale of a diffusion kernel. For efficient computation, we derive closed-form derivatives of approximations of the graph convolution with respect to the scale, so that node-wise range can be adaptively learned.With a downstream classifier, the entire framework is made trainable in an end-to-end manner. Our model is tested on various standard datasets for node-wise classification for the state-of-the-art performance, and it is also validated on a real-world brain network data for graph classifications to demonstrate its practicality for Alzheimer classification. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract) b/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract) new file mode 100644 index 0000000000..18c0bbeb8a --- /dev/null +++ b/data/2024/aaai/Learning to Build Solutions in Stochastic Matching Problems Using Flows (Student Abstract) @@ -0,0 +1 @@ +Generative Flow Networks, known as GFlowNets, have been introduced in recent times, presenting an exciting possibility for neural networks to model distributions across various data structures. In this paper, we broaden their applicability to encompass scenarios where the data structures are optimal solutions of a combinatorial problem. Concretely, we propose the use of GFlowNets to learn the distribution of optimal solutions for kidney exchange problems (KEPs), a generalized form of matching problems involving cycles. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Learn Better Visual Prompts b/data/2024/aaai/Learning to Learn Better Visual Prompts new file mode 100644 index 0000000000..bc54e25c53 --- /dev/null +++ b/data/2024/aaai/Learning to Learn Better Visual Prompts @@ -0,0 +1 @@ +Prompt tuning provides a low-cost way of adapting vision-language models (VLMs) for various downstream vision tasks without requiring updating the huge pre-trained parameters. Dispensing with the conventional manual crafting of prompts, the recent prompt tuning method of Context Optimization (CoOp) introduces adaptable vectors as text prompts. Nevertheless, several previous works point out that the CoOp-based approaches are easy to overfit to the base classes and hard to generalize to novel classes. In this paper, we reckon that the prompt tuning works well only in the base classes because of the limited capacity of the adaptable vectors. The scale of the pre-trained model is hundreds times the scale of the adaptable vector, thus the learned vector has a very limited ability to absorb the knowledge of novel classes. To minimize this excessive overfitting of textual knowledge on the base class, we view prompt tuning as learning to learn (LoL) and learn the prompt in the way of meta-learning, the training manner of dividing the base classes into many different subclasses could fully exert the limited capacity of prompt tuning and thus transfer it power to recognize the novel classes. To be specific, we initially perform fine-tuning on the base class based on the CoOp method for pre-trained CLIP. Subsequently, predicated on the fine-tuned CLIP model, we carry out further fine-tuning in an N-way K-shot manner from the perspective of meta-learning on the base classes. We finally apply the learned textual vector and VLM for unseen classes.Extensive experiments on benchmark datasets validate the efficacy of our meta-learning-informed prompt tuning, affirming its role as a robust optimization strategy for VLMs. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition b/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition new file mode 100644 index 0000000000..cd89ed7f6d --- /dev/null +++ b/data/2024/aaai/Learning to Learn in Interactive Constraint Acquisition @@ -0,0 +1,4 @@ +Constraint Programming (CP) has been successfully used to model and solve complex combinatorial problems. However, modeling is often not trivial and requires expertise, which is a bottleneck to wider adoption. In Constraint Acquisition (CA), the goal is to assist the user by automatically learning the model. +In (inter)active CA, this is done by interactively posting queries to the user, e.g. does this partial solution satisfy your (unspecified) constraints or not. +While interactive CA methods learn the constraints, the learning is related to symbolic concept learning, as the goal is to learn an exact representation. +However, a large number of queries is required to learn the model, which is a major limitation. In this paper, we aim to alleviate this limitation by tightening the connection of CA and Machine Learning (ML), by, for the first time in interactive CA, exploiting statistical ML methods. We propose to use probabilistic classification models to guide interactive CA queries to the most promising parts. We discuss how to train classifiers to predict whether a candidate expression from the bias is a constraint of the problem or not, using both relation-based and scope-based features. We then show how the predictions can be used in all layers of interactive CA: the query generation, the scope finding, and the lowest-level constraint finding. We experimentally evaluate our proposed methods using different classifiers and show that our methods greatly outperform the state of the art, decreasing the number of queries needed to converge by up to 72%. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Manipulate Artistic Images b/data/2024/aaai/Learning to Manipulate Artistic Images new file mode 100644 index 0000000000..6b5c40c2ce --- /dev/null +++ b/data/2024/aaai/Learning to Manipulate Artistic Images @@ -0,0 +1 @@ +Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplar-based image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at https://github.com/SnailForce/SIM-Net. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning b/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning new file mode 100644 index 0000000000..ce119c42ac --- /dev/null +++ b/data/2024/aaai/Learning to Optimize Permutation Flow Shop Scheduling via Graph-Based Imitation Learning @@ -0,0 +1 @@ +The permutation flow shop scheduling (PFSS), aiming at finding the optimal permutation of jobs, is widely used in manufacturing systems. When solving large-scale PFSS problems, traditional optimization algorithms such as heuristics could hardly meet the demands of both solution accuracy and computational efficiency, thus learning-based methods have recently garnered more attention. Some work attempts to solve the problems by reinforcement learning methods, which suffer from slow convergence issues during training and are still not accurate enough regarding the solutions. To that end, we propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Moreover, in order to extract better feature representations of input jobs, we incorporate the graph structure as the encoder. The extensive experiments reveal that our proposed model obtains significant promotion and presents excellent generalizability in large-scale problems with up to 1000 jobs. Compared to the state-of-the-art reinforcement learning method, our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average. The code is available at: https://github.com/longkangli/PFSS-IL. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Pivot as a Smart Expert b/data/2024/aaai/Learning to Pivot as a Smart Expert new file mode 100644 index 0000000000..dc3c49d317 --- /dev/null +++ b/data/2024/aaai/Learning to Pivot as a Smart Expert @@ -0,0 +1 @@ +Linear programming has been practically solved mainly by simplex and interior point methods. Compared with the weakly polynomial complexity obtained by the interior point methods, the existence of strongly polynomial bounds for the length of the pivot path generated by the simplex methods remains a mystery. In this paper, we propose two novel pivot experts that leverage both global and local information of the linear programming instances for the primal simplex method and show their excellent performance numerically. The experts can be regarded as a benchmark to evaluate the performance of classical pivot rules, although they are hard to directly implement. To tackle this challenge, we employ a graph convolutional neural network model, trained via imitation learning, to mimic the behavior of the pivot expert. Our pivot rule, learned empirically, displays a significant advantage over conventional methods in various linear programming problems, as demonstrated through a series of rigorous experiments. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning b/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning new file mode 100644 index 0000000000..f383fab7dd --- /dev/null +++ b/data/2024/aaai/Learning to Prompt Knowledge Transfer for Open-World Continual Learning @@ -0,0 +1 @@ +This paper studies the problem of continual learning in an open-world scenario, referred to as Open-world Continual Learning (OwCL). OwCL is increasingly rising while it is highly challenging in two-fold: i) learning a sequence of tasks without forgetting knowns in the past, and ii) identifying unknowns (novel objects/classes) in the future. Existing OwCL methods suffer from the adaptability of task-aware boundaries between knowns and unknowns, and do not consider the mechanism of knowledge transfer. In this work, we propose Pro-KT, a novel prompt-enhanced knowledge transfer model for OwCL. Pro-KT includes two key components: (1) a prompt bank to encode and transfer both task-generic and task-specific knowledge, and (2) a task-aware open-set boundary to identify unknowns in the new tasks. Experimental results using two real-world datasets demonstrate that the proposed Pro-KT outperforms the state-of-the-art counterparts in both the detection of unknowns and the classification of knowns markedly. Code released at https://github.com/YujieLi42/Pro-KT. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Rank in Generative Retrieval b/data/2024/aaai/Learning to Rank in Generative Retrieval new file mode 100644 index 0000000000..11d46d2591 --- /dev/null +++ b/data/2024/aaai/Learning to Rank in Generative Retrieval @@ -0,0 +1 @@ +Generative retrieval stands out as a promising new paradigm in text retrieval that aims to generate identifier strings of relevant passages as the retrieval target. This generative paradigm taps into powerful generative language models, distinct from traditional sparse or dense retrieval methods. However, only learning to generate is insufficient for generative retrieval. Generative retrieval learns to generate identifiers of relevant passages as an intermediate goal and then converts predicted identifiers into the final passage rank list. The disconnect between the learning objective of autoregressive models and the desired passage ranking target leads to a learning gap. To bridge this gap, we propose a learning-to-rank framework for generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn to rank passages directly, optimizing the autoregressive model toward the final passage ranking target via a rank loss. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public benchmarks, and the results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods. The code and checkpoints are released at https://github.com/liyongqi67/LTRGR. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network b/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network new file mode 100644 index 0000000000..d1adb89583 --- /dev/null +++ b/data/2024/aaai/Learning to Reweight for Generalizable Graph Neural Network @@ -0,0 +1,8 @@ +Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data. +The fundamental reason for the severe degeneration is that most GNNs are designed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation. +In this paper, we study the problem of the generalization ability of GNNs on Out-Of-Distribution (OOD) settings. +To solve this problem, we propose the Learning to Reweight for Generalizable Graph Neural Network (L2R-GNN) to enhance the generalization ability for achieving satisfactory performance on unseen testing graphs that have different distributions with training graphs. +We propose a novel nonlinear graph decorrelation method, which can substantially improve the out-of-distribution generalization ability and compares favorably to previous methods in restraining the over-reduced sample size. +The variables of graph representation are clustered based on the stability of their correlations, and graph decorrelation method learns weights to remove correlations between the variables of different clusters rather than any two variables. +Besides, we introduce an effective stochastic algorithm based on bi-level optimization for the L2R-GNN framework, which enables simultaneously learning the optimal weights and GNN parameters, and avoids the over-fitting issue. +Experiments show that L2R-GNN greatly outperforms baselines on various graph prediction benchmarks under distribution shifts. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming b/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming new file mode 100644 index 0000000000..ee744b141a --- /dev/null +++ b/data/2024/aaai/Learning to Stop Cut Generation for Efficient Mixed-Integer Linear Programming @@ -0,0 +1 @@ +Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), as they significantly tighten the dual bounds and improve the solving performance. A key problem for cuts is when to stop cuts generation, which is important for the efficiency of solving MILPs. However, many modern MILP solvers employ hard-coded heuristics to tackle this problem, which tends to neglect underlying patterns among MILPs from certain applications. To address this challenge, we formulate the cuts generation stopping problem as a reinforcement learning problem and propose a novel hybrid graph representation model (HYGRO) to learn effective stopping strategies. An appealing feature of HYGRO is that it can effectively capture both the dynamic and static features of MILPs, enabling dynamic decision-making for the stopping strategies. To the best of our knowledge, HYGRO is the first data-driven method to tackle the cuts generation stopping problem. By integrating our approach with modern solvers, experiments demonstrate that HYGRO significantly improves the efficiency of solving MILPs compared to competitive baselines, achieving up to 31% improvement. \ No newline at end of file diff --git a/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers b/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers new file mode 100644 index 0000000000..56a1926617 --- /dev/null +++ b/data/2024/aaai/Learning to Unlearn: Instance-Wise Unlearning for Pre-trained Classifiers @@ -0,0 +1 @@ +Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting b/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting new file mode 100644 index 0000000000..2d03fa1f67 --- /dev/null +++ b/data/2024/aaai/Learning with Noisy Labels Using Hyperspherical Margin Weighting @@ -0,0 +1 @@ +Datasets often include noisy labels, but learning from them is difficult. Since mislabeled examples usually have larger loss values in training, the small-loss trick is regarded as a standard metric to identify the clean example from the training set for better performance. Nonetheless, this proposal ignores that some clean but hard-to-learn examples also generate large losses. They could be misidentified by this criterion. In this paper, we propose a new metric called the Integrated Area Margin (IAM), which is superior to the traditional small-loss trick, particularly in recognizing the clean but hard-to-learn examples. According to the IAM, we further offer the Hyperspherical Margin Weighting (HMW) approach. It is a new sample weighting strategy that restructures the importance of each example. It should be highlighted that our approach is universal and can strengthen various methods in this field. Experiments on both benchmark and real-world datasets indicate that our HMW outperforms many state-of-the-art approaches in learning with noisy label tasks. Codes are available at https://github.com/Zhangshuojackpot/HMW. \ No newline at end of file diff --git a/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem b/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem new file mode 100644 index 0000000000..e50fc21a0a --- /dev/null +++ b/data/2024/aaai/Learning-Augmented Online Algorithm for Two-Level Ski-Rental Problem @@ -0,0 +1 @@ +In this paper, we study the two-level ski-rental problem, where a user needs to fulfill a sequence of demands for multiple items by choosing one of the three payment options: paying for the on-demand usage (i.e., rent), buying individual items (i.e., single purchase), and buying all the items (i.e., combo purchase). Without knowing future demands, the user aims to minimize the total cost (i.e., the sum of the rental, single purchase, and combo purchase costs) by balancing the trade-off between the expensive upfront costs (for purchase) and the potential future expenses (for rent). We first design a robust online algorithm (RDTSR) that offers a worst-case performance guarantee. While online algorithms are robust against the worst-case scenarios, they are often overly cautious and thus suffer a poor average performance in typical scenarios. On the other hand, Machine Learning (ML) algorithms typically show promising average performance in various applications but lack worst-case performance guarantees. To harness the benefits of both methods, we develop a learning-augmented algorithm (LADTSR) by integrating ML predictions into the robust online algorithm, which outperforms the robust online algorithm under accurate predictions while ensuring worst-case performance guarantees even when predictions are inaccurate. Finally, we conduct numerical experiments on both synthetic and real-world trace data to corroborate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize b/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize new file mode 100644 index 0000000000..09b2be03d2 --- /dev/null +++ b/data/2024/aaai/Leaving the Nest: Going beyond Local Loss Functions for Predict-Then-Optimize @@ -0,0 +1 @@ +Predict-then-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can we use the structure of a decision-making task to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken. \ No newline at end of file diff --git a/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation b/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation new file mode 100644 index 0000000000..3b0d2fc8a6 --- /dev/null +++ b/data/2024/aaai/Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation @@ -0,0 +1 @@ +Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels. \ No newline at end of file diff --git a/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval b/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval new file mode 100644 index 0000000000..d8838b98af --- /dev/null +++ b/data/2024/aaai/Let All Be Whitened: Multi-Teacher Distillation for Efficient Visual Retrieval @@ -0,0 +1 @@ +Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at https://github.com/Maryeon/whiten_mtd. \ No newline at end of file diff --git a/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos b/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos new file mode 100644 index 0000000000..6e81ffa426 --- /dev/null +++ b/data/2024/aaai/Let There Be Sound: Reconstructing High Quality Speech from Silent Videos @@ -0,0 +1 @@ +The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS. \ No newline at end of file diff --git a/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage b/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage new file mode 100644 index 0000000000..eed0e1a7cd --- /dev/null +++ b/data/2024/aaai/Levenshtein Distance Embedding with Poisson Regression for DNA Storage @@ -0,0 +1 @@ +Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract) b/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract) new file mode 100644 index 0000000000..113a7ef8f6 --- /dev/null +++ b/data/2024/aaai/Leverage the Explainability of Transformer Models to Improve the DNA 5-Methylcytosine Identification (Student Abstract) @@ -0,0 +1 @@ +DNA methylation is an epigenetic mechanism for regulating gene expression, and it plays an important role in many biological processes. While methylation sites can be identified using laboratory techniques, much work is being done on developing computational approaches using machine learning. Here, we present a deep-learning algorithm for determining the 5-methylcytosine status of a DNA sequence. We propose an ensemble framework that treats the self-attention score as an explicit feature that is added to the encoder layer generated by fine-tuned language models. We evaluate the performance of the model under different data distribution scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision b/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision new file mode 100644 index 0000000000..5545dd314e --- /dev/null +++ b/data/2024/aaai/Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision @@ -0,0 +1 @@ +Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model’s downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection b/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection new file mode 100644 index 0000000000..0ea4e3b008 --- /dev/null +++ b/data/2024/aaai/Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection @@ -0,0 +1 @@ +Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget. Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI). Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models. Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning b/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning new file mode 100644 index 0000000000..254065daee --- /dev/null +++ b/data/2024/aaai/Leveraging Local Variance for Pseudo-Label Selection in Semi-supervised Learning @@ -0,0 +1,2 @@ +Semi-supervised learning algorithms that use pseudo-labeling have become increasingly popular for improving model performance by utilizing both labeled and unlabeled data. +In this paper, we offer a fresh perspective on the selection of pseudo-labels, inspired by theoretical insights. We suggest that pseudo-labels with a high degree of local variance are more prone to inaccuracies. Based on this premise, we introduce the Local Variance Match (LVM) method, which aims to optimize the selection of pseudo-labels in semi-supervised learning (SSL) tasks. Our methodology is validated through a series of experiments on widely-used image classification datasets, such as CIFAR-10, CIFAR-100, and SVHN, spanning various labeled data quantity scenarios. The empirical findings show that the LVM method substantially outpaces current SSL techniques, achieving state-of-the-art results in many of these scenarios. For instance, we observed an error rate of 5.41% on CIFAR-10 with a single label for each class, 35.87% on CIFAR-100 when using four labels per class, and 1.94% on SVHN with four labels for each class. Notably, the standout error rate of 5.41% is less than 1% shy of the performance in a fully-supervised learning environment. In experiments on ImageNet with 100k labeled data, the LVM also reached state-of-the-art outcomes. Additionally, the efficacy of the LVM method is further validated by its stellar performance in speech recognition experiments. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning b/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning new file mode 100644 index 0000000000..1b6e1b0282 --- /dev/null +++ b/data/2024/aaai/Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation for Cross-Domain Few-Shot Learning @@ -0,0 +1 @@ +Cross-domain few-shot learning presents a formidable challenge, as models must be trained on base classes and then tested on novel classes from various domains with only a few samples at hand. While prior approaches have primarily focused on parameter-efficient methods of using adapters, they often overlook two critical issues: shifts in batch statistics and noisy sample statistics arising from domain discrepancy variations. In this paper, we introduce Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation (ProLAD), marking two principal contributions. First, our methodology utilizes two separate adapters: one devoid of a normalization layer, which is more effective for similar domains, and another embedded with a normalization layer, designed to leverage the batch statistics of the target domain, thus proving effective for dissimilar domains. Second, to address the pitfalls of noisy statistics, we deploy two strategies: a progressive training of the two adapters and an adaptive distillation technique derived from features determined by the model solely with the adapter devoid of a normalization layer. Through this adaptive distillation, our approach functions as a modulator, controlling the primary adapter for adaptation, based on each domain. Evaluations on standard cross-domain few-shot learning benchmarks confirm that our technique outperforms existing state-of-the-art methodologies. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation b/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation new file mode 100644 index 0000000000..2572749234 --- /dev/null +++ b/data/2024/aaai/Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation @@ -0,0 +1 @@ +Online dating platforms have gained widespread popularity as a means for individuals to seek potential romantic relationships. While recommender systems have been designed to improve the user experience in dating platforms by providing personalized recommendations, increasing concerns about fairness have encouraged the development of fairness-aware recommender systems from various perspectives (e.g., gender and race). However, sexual orientation, which plays a significant role in finding a satisfying relationship, is under-investigated. To fill this crucial gap, we propose a novel metric, Opposite Gender Interaction Ratio (OGIR), as a way to investigate potential unfairness for users with varying preferences towards the opposite gender. We empirically analyze a real online dating dataset and observe existing recommender algorithms could suffer from group unfairness according to OGIR. We further investigate the potential causes for such gaps in recommendation quality, which lead to the challenges of group quantity imbalance and group calibration imbalance. Ultimately, we propose a fair recommender system based on re-weighting and re-ranking strategies to respectively mitigate these associated imbalance challenges. Experimental results demonstrate both strategies improve fairness while their combination achieves the best performance towards maintaining model utility while improving fairness. \ No newline at end of file diff --git a/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning b/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..a8c4e425c4 --- /dev/null +++ b/data/2024/aaai/Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Incorporating symmetry as an inductive bias into multi-agent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority. \ No newline at end of file diff --git a/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing b/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing new file mode 100644 index 0000000000..90e8246152 --- /dev/null +++ b/data/2024/aaai/Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing @@ -0,0 +1 @@ +Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples. \ No newline at end of file diff --git a/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation b/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation new file mode 100644 index 0000000000..2242b9efdc --- /dev/null +++ b/data/2024/aaai/Lifting by Image - Leveraging Image Cues for Accurate 3D Human Pose Estimation @@ -0,0 +1 @@ +The "lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE) due to the powerful visual analysis ability of 2D pose estimators. Widely known, there exists a depth ambiguity problem when estimating solely from 2D pose, where one 2D pose can be mapped to multiple 3D poses. Intuitively, the rich semantic and texture information in images can contribute to a more accurate "lifting" procedure. Yet, existing research encounters two primary challenges. Firstly, the distribution of image data in 3D motion capture datasets is too narrow because of the laboratorial environment, which leads to poor generalization ability of methods trained with image information. Secondly, effective strategies for leveraging image information are lacking. In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features. Based on that, we propose an advanced framework. Specifically, the framework consists of two stages. First, we enable the keypoints to query and select the beneficial features from all image patches. To reduce the keypoints attention to inconsequential background features, we design a novel Pose-guided Transformer Layer, which adaptively limits the updates to unimportant image patches. Then, through a designed Adaptive Feature Selection Module, we prune less significant image patches from the feature map. In the second stage, we allow the keypoints to further emphasize the retained critical image features. This progressive learning approach prevents further training on insignificant image features. Experimental results show that our model achieves state-of-the-art performance on both the Human3.6M dataset and the MPI-INF-3DHP dataset. \ No newline at end of file diff --git a/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack b/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack new file mode 100644 index 0000000000..f344338ec0 --- /dev/null +++ b/data/2024/aaai/LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack @@ -0,0 +1 @@ +Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training. \ No newline at end of file diff --git a/data/2024/aaai/Limitations of Face Image Generation b/data/2024/aaai/Limitations of Face Image Generation new file mode 100644 index 0000000000..df4b2457a7 --- /dev/null +++ b/data/2024/aaai/Limitations of Face Image Generation @@ -0,0 +1 @@ +Text-to-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models. Our survey data and analytics code can be found online at https://github.com/wi-pi/Limitations_of_Face_Generation \ No newline at end of file diff --git a/data/2024/aaai/Limited Query Graph Connectivity Test b/data/2024/aaai/Limited Query Graph Connectivity Test new file mode 100644 index 0000000000..18df13ddca --- /dev/null +++ b/data/2024/aaai/Limited Query Graph Connectivity Test @@ -0,0 +1,7 @@ +We propose a combinatorial optimisation model called Limited Query Graph Connectivity Test. We consider a graph whose edges have two possible states (On/Off). The edges' states are hidden initially. We could query an edge to reveal its state. Given a source s and a destination t, we aim to test s−t connectivity by identifying either a path (consisting of only On edges) or a cut (consisting of only Off edges). We are limited to B queries, after which we stop regardless of whether graph connectivity is established. We aim to design a query policy that minimizes the expected number of queries. + +Our model is mainly motivated by a cyber security use case where we need to establish whether attack paths exist in a given network, between a source (i.e., a compromised user node) and a destination (i.e., a high-privilege admin node). Edge query is resolved by manual effort from the IT admin, which is the motivation behind query minimization. + +Our model is highly related to Stochastic Boolean Function Evaluation (SBFE). There are two existing exact algorithms for SBFE that are prohibitively expensive. We propose a signifcantly more scalable exact algorithm. While previous exact algorithms only scale for trivial graphs (i.e., past works experimented on at most 20 edges), we empirically demonstrate that our algorithm is scalable for a wide range of much larger practical graphs (i.e., graphs representing Windows domain networks with tens of thousands of edges). + +We also propose three heuristics. Our best-performing heuristic is via limiting the planning horizon of the exact algorithm. The other two are via reinforcement learning (RL) and Monte Carlo tree search (MCTS). We also derive an algorithm for computing the performance lower bound. Experimentally, we show that all our heuristics are near optimal. The heuristic building on the exact algorithm outperforms all other heuristics, surpassing RL, MCTS and eight existing heuristics ported from SBFE and related literature. \ No newline at end of file diff --git a/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise b/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise new file mode 100644 index 0000000000..b264647ef9 --- /dev/null +++ b/data/2024/aaai/Limited-Supervised Multi-Label Learning with Dependency Noise @@ -0,0 +1 @@ +Limited-supervised multi-label learning (LML) leverages weak or noisy supervision for multi-label classification model training over data with label noise, which contain missing labels and/or redundant labels. Existing studies usually solve LML problems by assuming that label noise is independent of the input features and class labels, while ignoring the fact that noisy labels may depend on the input features (instance-dependent) and the classes (label-dependent) in many real-world applications. In this paper, we propose limited-supervised Multi-label Learning with Dependency Noise (MLDN) to simultaneously identify the instance-dependent and label-dependent label noise by factorizing the noise matrix as the outputs of a mapping from the feature and label representations. Meanwhile, we regularize the problem with the manifold constraint on noise matrix to preserve local relationships and uncover the manifold structure. Theoretically, we bound noise recover error for the resulting problem. We solve the problem by using a first-order scheme based on proximal operator, and the convergence rate of it is at least sub-linear. Extensive experiments conducted on various datasets demonstrate the superiority of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs b/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs new file mode 100644 index 0000000000..6cdc32a54e --- /dev/null +++ b/data/2024/aaai/Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs @@ -0,0 +1 @@ +Causal effect estimation from observational data is a fundamental task in empirical sciences. It becomes particularly challenging when unobserved confounders are involved in a system. This paper focuses on front-door adjustment – a classic technique which, using observed mediators allows to identify causal effects even in the presence of unobserved confounding. While the statistical properties of the front-door estimation are quite well understood, its algorithmic aspects remained unexplored for a long time. In 2022, Jeong, Tian, and Bareinboim presented the first polynomial-time algorithm for finding sets satisfying the front-door criterion in a given directed acyclic graph (DAG), with an O(n³(n+m)) run time, where n denotes the number of variables and m the number of edges of the causal graph. In our work, we give the first linear-time, i.e., O(n+m), algorithm for this task, which thus reaches the asymptotically optimal time complexity. This result implies an O(n(n+m)) delay enumeration algorithm of all front-door adjustment sets, again improving previous work by a factor of n³. Moreover, we provide the first linear-time algorithm for finding a minimal front-door adjustment set. We offer implementations of our algorithms in multiple programming languages to facilitate practical usage and empirically validate their feasibility, even for large graphs. \ No newline at end of file diff --git a/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata b/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata new file mode 100644 index 0000000000..96ea57390a --- /dev/null +++ b/data/2024/aaai/Linear-Time Verification of Data-Aware Processes Modulo Theories via Covers and Automata @@ -0,0 +1 @@ +The need to model and analyse dynamic systems operating over complex data is ubiquitous in AI and neighboring areas, in particular business process management. Analysing such data-aware systems is a notoriously difficult problem, as they are intrinsically infinite-state. Existing approaches work for specific datatypes, and/or limit themselves to the verification of safety properties. In this paper, we lift both such limitations, studying for the first time linear-time verification for so-called data-aware processes modulo theories (DMTs), from the foundational and practical point of view. The DMT model is very general, as it supports processes operating over variables that can store arbitrary types of data, ranging over infinite domains and equipped with domain-specific predicates. Specifically, we provide four contributions. First, we devise a semi-decision procedure for linear-time verification of DMTs, which works for a very large class of datatypes obeying to mild model-theoretic assumptions. The procedure relies on a unique combination of automata-theoretic and cover computation techniques to respectively deal with linear-time properties and datatypes. Second, we identify an abstract, semantic property that guarantees the existence of a faithful finite-state abstraction of the original system, and show that our method becomes a decision procedure in this case. Third, we identify concrete, checkable classes of systems that satisfy this property, generalising several results in the literature. Finally, we present an implementation and an experimental evaluation over a benchmark of real-world data-aware business processes. \ No newline at end of file diff --git a/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding b/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding new file mode 100644 index 0000000000..3a2a7f2a30 --- /dev/null +++ b/data/2024/aaai/Link Prediction in Multilayer Networks via Cross-Network Embedding @@ -0,0 +1 @@ +Link prediction is a fundamental task in network analysis, with the objective of predicting missing or potential links. While existing studies have mainly concentrated on single networks, it is worth noting that numerous real-world networks exhibit interconnectedness. For example, individuals often register on various social media platforms to access diverse services, such as chatting, tweeting, blogging, and rating movies. These platforms share a subset of users and are termed multilayer networks. The interlayer links in such networks hold valuable information that provides more comprehensive insights into the network structure. To effectively exploit this complementary information and enhance link prediction in the target network, we propose a novel cross-network embedding method. This method aims to represent different networks in a shared latent space, preserving proximity within single networks as well as consistency across multilayer networks. Specifically, nodes can aggregate messages from aligned nodes in other layers. Extensive experiments conducted on real-world datasets demonstrate the superior performance of our proposed method for link prediction in multilayer networks. \ No newline at end of file diff --git a/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views b/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views new file mode 100644 index 0000000000..355e80493c --- /dev/null +++ b/data/2024/aaai/Live and Learn: Continual Action Clustering with Incremental Views @@ -0,0 +1 @@ +Multi-view action clustering leverages the complementary information from different camera views to enhance the clustering performance. Although existing approaches have achieved significant progress, they assume all camera views are available in advance, which is impractical when the camera view is incremental over time. Besides, learning the invariant information among multiple camera views is still a challenging issue, especially in continual learning scenario. Aiming at these problems, we propose a novel continual action clustering (CAC) method, which is capable of learning action categories in a continual learning manner. To be specific, we first devise a category memory library, which captures and stores the learned categories from historical views. Then, as a new camera view arrives, we only need to maintain a consensus partition matrix, which can be updated by leveraging the incoming new camera view rather than keeping all of them. Finally, a three-step alternate optimization is proposed, in which the category memory library and consensus partition matrix are optimized. The empirical experimental results on 6 realistic multi-view action collections demonstrate the excellent clustering performance and time/space efficiency of the CAC compared with 15 state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract) b/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract) new file mode 100644 index 0000000000..3de66a6895 --- /dev/null +++ b/data/2024/aaai/Local Consistency Guidance: Personalized Stylization Method of Face Video (Student Abstract) @@ -0,0 +1 @@ +Face video stylization aims to convert real face videos into specified reference styles. While one-shot methods perform well in single-image stylization, ensuring continuity between frames and retaining the original facial expressions present challenges in video stylization. To address these issues, our approach employs a personalized diffusion model with pixel-level control. We propose Local Consistency Guidance(LCG) strategy, composed of local-cross attention and local style transfer, to ensure temporal consistency. This framework enables the synthesis of high-quality stylized face videos with excellent temporal continuity. \ No newline at end of file diff --git a/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding b/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding new file mode 100644 index 0000000000..a932da4ee7 --- /dev/null +++ b/data/2024/aaai/Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding @@ -0,0 +1 @@ +This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weaklysupervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities† would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without compromising the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs. \ No newline at end of file diff --git a/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps b/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps new file mode 100644 index 0000000000..4bf8cb5266 --- /dev/null +++ b/data/2024/aaai/Locality Preserving Refinement for Shape Matching with Functional Maps @@ -0,0 +1 @@ +In this paper, we address the nonrigid shape matching with outliers by a novel and effective pointwise map refinement method, termed Locality Preserving Refinement. For accurate pointwise conversion from a given functional map, our method formulates a two-step procedure. Firstly, starting with noisy point-to-point correspondences, we identify inliers by leveraging the neighborhood support, which yields a closed-form solution with linear time complexity. After obtained the reliable correspondences of inliers, we refine the pointwise correspondences for outliers using local linear embedding, which operates in an adaptive spectral similarity space to further eliminate the ambiguities that are difficult to handle in the functional space. By refining pointwise correspondences with local consistency thus embedding geometric constraints into functional spaces, our method achieves considerable improvement in accuracy with linearithmic time and space cost. Extensive experiments on public benchmarks demonstrate the superiority of our method over the state-of-the-art methods. Our code is publicly available at https://github.com/XiaYifan1999/LOPR. \ No newline at end of file diff --git a/data/2024/aaai/Locally Rainbow Paths b/data/2024/aaai/Locally Rainbow Paths new file mode 100644 index 0000000000..5f62b23915 --- /dev/null +++ b/data/2024/aaai/Locally Rainbow Paths @@ -0,0 +1 @@ +We introduce the algorithmic problem of finding a locally rainbow path of length l connecting two distinguished vertices s and t in a vertex-colored directed graph. Herein, a path is locally rainbow if between any two visits of equally colored vertices, the path traverses consecutively at leaset r differently colored vertices. This problem generalizes the well-known problem of finding a rainbow path. It finds natural applications whenever there are different types of resources that must be protected from overuse, such as crop sequence optimization or production process scheduling. We show that the problem is computationally intractable even if r=2 or if one looks for a locally rainbow among the shortest paths. On the positive side, if one looks for a path that takes only a short detour (i.e., it is slightly longer than the shortest path) and if r is small, the problem can be solved efficiently. Indeed, the running time of the respective algorithm is near-optimal unless the ETH fails. \ No newline at end of file diff --git a/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer b/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer new file mode 100644 index 0000000000..07cef35645 --- /dev/null +++ b/data/2024/aaai/LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer @@ -0,0 +1 @@ +Video recognition systems are vulnerable to adversarial examples. Recent studies show that style transfer-based and patch-based unrestricted perturbations can effectively improve attack efficiency. These attacks, however, face two main challenges: 1) Adding large stylized perturbations to all pixels reduces the naturalness of the video and such perturbations can be easily detected. 2) Patch-based video attacks are not extensible to targeted attacks due to the limited search space of reinforcement learning that has been widely used in video attacks recently. In this paper, we focus on the video black-box setting and propose a novel attack framework named LogoStyleFool by adding a stylized logo to the clean video. We separate the attack into three stages: style reference selection, reinforcement-learning-based logo style transfer, and perturbation optimization. We solve the first challenge by scaling down the perturbation range to a regional logo, while the second challenge is addressed by complementing an optimization stage after reinforcement learning. Experimental results substantiate the overall superiority of LogoStyleFool over three state-of-the-art patch-based attacks in terms of attack performance and semantic preservation. Meanwhile, LogoStyleFool still maintains its performance against two existing patch-based defense methods. We believe that our research is beneficial in increasing the attention of the security community to such subregional style transfer attacks. \ No newline at end of file diff --git a/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization b/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization new file mode 100644 index 0000000000..d2b0a250d6 --- /dev/null +++ b/data/2024/aaai/Long-Tailed Learning as Multi-Objective Optimization @@ -0,0 +1 @@ +Real-world data is extremely imbalanced and presents a long-tailed distribution, resulting in models biased towards classes with sufficient samples and performing poorly on rare classes. Recent methods propose to rebalance classes but they undertake the seesaw dilemma (what is increasing performance on tail classes may decrease that of head classes, and vice versa). In this paper, we argue that the seesaw dilemma is derived from the gradient imbalance of different classes, in which gradients of inappropriate classes are set to important for updating, thus prone to overcompensation or undercompensation on tail classes. To achieve ideal compensation, we formulate long-tailed recognition as a multi-objective optimization problem, which fairly respects the contributions of head and tail classes simultaneously. For efficiency, we propose a Gradient-Balancing Grouping (GBG) strategy to gather the classes with similar gradient directions, thus approximately making every update under a Pareto descent direction. Our GBG method drives classes with similar gradient directions to form a more representative gradient and provides ideal compensation to the tail classes. Moreover, we conduct extensive experiments on commonly used benchmarks in long-tailed learning and demonstrate the superiority of our method over existing SOTA methods. Our code is released at https://github.com/WickyLee1998/GBG_v1. \ No newline at end of file diff --git a/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation b/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation new file mode 100644 index 0000000000..d755ea0516 --- /dev/null +++ b/data/2024/aaai/Long-Tailed Partial Label Learning by Head Classifier and Tail Classifier Cooperation @@ -0,0 +1 @@ +In partial label learning (PLL), each instance is associated with a set of candidate labels, among which only one is correct. The traditional PLL almost all implicitly assume that the distribution of the classes is balanced. However, in real-world applications, the distribution of the classes is imbalanced or long-tailed, leading to the long-tailed partial label learning problem. The previous methods solve this problem mainly by ameliorating the ability to learn in the tail classes, which will sacrifice the performance of the head classes. While keeping the performance of the head classes may degrade the performance of the tail classes. Therefore, in this paper, we construct two classifiers, i.e., a head classifier for keeping the performance of dominant classes and a tail classifier for improving the performance of the tail classes. Then, we propose a classifier weight estimation module to automatically estimate the shot belongingness (head class or tail class) of the samples and allocate the weights for the head classifier and tail classifier when making prediction. This cooperation improves the prediction ability for both the head classes and the tail classes. The experiments on the benchmarks demonstrate the proposed approach improves the accuracy of the SOTA methods by a substantial margin. Code and data are available at: https://github.com/pruirui/HTC-LTPLL. \ No newline at end of file diff --git a/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models b/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models new file mode 100644 index 0000000000..d2378f2983 --- /dev/null +++ b/data/2024/aaai/Long-Term Fair Decision Making through Deep Generative Models @@ -0,0 +1 @@ +This paper studies long-term fair machine learning which aims to mitigate group disparity over the long term in sequential decision-making systems. To define long-term fairness, we leverage the temporal causal graph and use the 1-Wasserstein distance between the interventional distributions of different demographic groups at a sufficiently large time step as the quantitative metric. Then, we propose a three-phase learning framework where the decision model is trained on high-fidelity data generated by a deep generative model. We formulate the optimization problem as a performative risk minimization and adopt the repeated gradient descent algorithm for learning. The empirical evaluation shows the efficacy of the proposed method using both synthetic and semi-synthetic datasets. \ No newline at end of file diff --git a/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback b/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback new file mode 100644 index 0000000000..1d27e95489 --- /dev/null +++ b/data/2024/aaai/Long-Term Safe Reinforcement Learning with Binary Feedback @@ -0,0 +1 @@ +Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Long-term Binary-feedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward. \ No newline at end of file diff --git a/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains b/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains new file mode 100644 index 0000000000..28a4a53580 --- /dev/null +++ b/data/2024/aaai/Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains @@ -0,0 +1 @@ +We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ε excess error to the Bayes optimal classifier requires at least poly(1/ε) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains. \ No newline at end of file diff --git a/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation b/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation new file mode 100644 index 0000000000..5677b84cca --- /dev/null +++ b/data/2024/aaai/Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation @@ -0,0 +1 @@ +Recently, instance contrastive learning achieves good results in unsupervised domain adaptation. It reduces the distances between positive samples and the anchor, increases the distances between negative samples and the anchor, and learns discriminative feature representations for target samples. However, most recent methods for identifying positive and negative samples are based on whether the pseudo-labels of samples and the pseudo-label of the anchor correspond to the same class. Due to the lack of target labels, many uncertain data are mistakenly labeled during the training process, and many low training potential data are also utilized. To address these problems, we propose Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation (LUHP). We first propose a weight to measure the category uncertainty of the target sample. We can effectively filter the samples near the decision boundary through category uncertainty thresholds which are calculated by weights. Then we propose a new loss to focus on samples with high training potential. Finally, for anchors with low category uncertainty, we propose a sample reuse strategy to make the model more robust. We demonstrate the effectiveness of LUHP by showing the results of four datasets widely used in unsupervised domain adaptation. \ No newline at end of file diff --git a/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information b/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information new file mode 100644 index 0000000000..559ecdd291 --- /dev/null +++ b/data/2024/aaai/Low-Distortion Clustering with Ordinal and Limited Cardinal Information @@ -0,0 +1,3 @@ +Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of n agents located in an underlying metric space, our goal is to partition them into k clusters, optimizing some social cost objective. The metric space is defined by a distance function d between the agent locations. Information about d is available only implicitly via n rankings, through which each agent ranks all other agents in terms of their distance from her. Still, even though no cardinal information (i.e., the exact distance values) is available, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using d. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available. + +Unfortunately, the most important clustering objectives (e.g., those used in the well-known k-median and k-center problems) do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than k clusters but compare their social cost to that of the optimal k-clusterings. We show that using exponentially (in terms of k) many clusters, we can get low (constant or logarithmic) distortion for the k-center and k-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for k-median and k-center, we show that a number of queries that is polynomial in k and only logarithmic in n (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion. \ No newline at end of file diff --git a/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering b/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering new file mode 100644 index 0000000000..7ded326559 --- /dev/null +++ b/data/2024/aaai/Low-Latency Space-Time Supersampling for Real-Time Rendering @@ -0,0 +1 @@ +With the rise of real-time rendering and the evolution of display devices, there is a growing demand for post-processing methods that offer high-resolution content in a high frame rate. Existing techniques often suffer from quality and latency issues due to the disjointed treatment of frame supersampling and extrapolation. In this paper, we recognize the shared context and mechanisms between frame supersampling and extrapolation, and present a novel framework, Space-time Supersampling (STSS). By integrating them into a unified framework, STSS can improve the overall quality with lower latency. To implement an efficient architecture, we treat the aliasing and warping holes unified as reshading regions and put forth two key components to compensate the regions, namely Random Reshading Masking (RRM) and Efficient Reshading Module (ERM). Extensive experiments demonstrate that our approach achieves superior visual fidelity compared to state-of-the-art (SOTA) methods. Notably, the performance is achieved within only 4ms, saving up to 75\% of time against the conventional two-stage pipeline that necessitates 17ms. \ No newline at end of file diff --git a/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation b/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation new file mode 100644 index 0000000000..54609cf62a --- /dev/null +++ b/data/2024/aaai/Low-Light Face Super-resolution via Illumination, Structure, and Texture Associated Representation @@ -0,0 +1 @@ +Human face captured at night or in dimly lit environments has become a common practice, accompanied by complex low-light and low-resolution degradations. However, the existing face super-resolution (FSR) technologies and derived cascaded schemes are inadequate to recover credible textures. In this paper, we propose a novel approach that decomposes the restoration task into face structural fidelity maintaining and texture consistency learning. The former aims to enhance the quality of face images while improving the structural fidelity, while the latter focuses on eliminating perturbations and artifacts caused by low-light degradation and reconstruction. Based on this, we develop a novel low-light low-resolution face super-resolution framework. Our method consists of two steps: an illumination correction face super-resolution network (IC-FSRNet) for lighting the face and recovering the structural information, and a detail enhancement model (DENet) for improving facial details, thus making them more visually appealing and easier to analyze. As the relighted regions could provide complementary information to boost face super-resolution and vice versa, we introduce the mutual learning to harness the informative components from relighted regions and reconstruction, and achieve the iterative refinement. In addition, DENet equipped with diffusion probabilistic model is built to further improve face image visual quality. Experiments demonstrate that the proposed joint optimization framework achieves significant improvements in reconstruction quality and perceptual quality over existing two-stage sequential solutions. Code is available at https://github.com/wcy-cs/IC-FSRDENet. \ No newline at end of file diff --git a/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering b/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering new file mode 100644 index 0000000000..aed24d6eb6 --- /dev/null +++ b/data/2024/aaai/Low-Rank Kernel Tensor Learning for Incomplete Multi-View Clustering @@ -0,0 +1 @@ +Incomplete Multiple Kernel Clustering algorithms, which aim to learn a common latent representation from pre-constructed incomplete multiple kernels from the original data, followed by k-means for clustering. They have attracted intensive attention due to their high computational efficiency. However, our observation reveals that the imputation of these approaches for each kernel ignores the influence of other incomplete kernels. In light of this, we present a novel method called Low-Rank Kernel Tensor Learning for Incomplete Multiple Views Clustering (LRKT-IMVC) to address the above issue. Specifically, LRKT-IMVC first introduces the concept of kernel tensor to explore the inter-view correlations, and then the low-rank kernel tensor constraint is used to further capture the consistency information to impute missing kernel elements, thereby improving the quality of clustering. Moreover, we carefully design an alternative optimization method with promising convergence to solve the resulting optimization problem. The proposed method is compared with recent advances in experiments with different missing ratios on seven well-known datasets, demonstrating its effectiveness and the advantages of the proposed interpolation method. \ No newline at end of file diff --git a/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models b/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models new file mode 100644 index 0000000000..bbbf6a725e --- /dev/null +++ b/data/2024/aaai/Lyapunov-Stable Deep Equilibrium Models @@ -0,0 +1 @@ +Deep equilibrium (DEQ) models have emerged as a promising class of implicit layer models, which abandon traditional depth by solving for the fixed points of a single nonlinear layer. Despite their success, the stability of the fixed points for these models remains poorly understood. By considering DEQ models as nonlinear dynamic systems, we propose a robust DEQ model named LyaDEQ with guaranteed provable stability via Lyapunov theory. The crux of our method is ensuring the Lyapunov stability of the DEQ model's fixed points, which enables the proposed model to resist minor initial perturbations. To avoid poor adversarial defense due to Lyapunov-stable fixed points being located near each other, we orthogonalize the layers after the Lyapunov stability module to separate different fixed points. We evaluate LyaDEQ models under well-known adversarial attacks, and experimental results demonstrate significant improvement in robustness. Furthermore, we show that the LyaDEQ model can be combined with other defense methods, such as adversarial training, to achieve even better adversarial robustness. \ No newline at end of file diff --git a/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving b/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving new file mode 100644 index 0000000000..ec809b4a13 --- /dev/null +++ b/data/2024/aaai/M-BEV: Masked BEV Perception for Robust Autonomous Driving @@ -0,0 +1 @@ +3D perception is a critical problem in autonomous driving. Recently, the Bird’s-Eye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates their performance. To tackle this problem, we propose a generic Masked BEV (M-BEV) perception framework, which can effectively improve robustness to this challenging scenario, by random masking and reconstructing camera views in the end-to-end training. More specifically, we develop a novel Masked View Reconstruction (MVR) module in our M-BEV. It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision and reconstructs the masked ones with the distinct spatio-temporal context across camera views. Via such a plug-and-play MVR, our M-BEV is capable of learning the missing views from the resting ones, and thus well generalized for robust view recovery and accurate perception in the testing. We perform extensive experiments on the popular NuScenes benchmark, where our framework can significantly boost 3D perception performance of the state-of-the-art models on various missing view cases, e.g., for the absence of back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain. \ No newline at end of file diff --git a/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis b/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis new file mode 100644 index 0000000000..2a983feaec --- /dev/null +++ b/data/2024/aaai/M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis @@ -0,0 +1 @@ +Document layout analysis is a crucial step for intelligent document understanding. However, many existing methods primarily focus on the visual aspects and overlook the textual features of documents. Although document pre-trained models utilize multi-modal features during the pre-training phase, they tend to operate as a unimodal pipeline when it comes to layout analysis tasks. Furthermore, current multi-modal methods perform worse than unimodal detectors on complex layout analysis datasets. To address these limitations, we propose an effective and pluggable multi-modal fusion approach named M2Doc, which fuses visual and textual features for better layout detection. M2Doc contains two pluggable multi-modal fusion modules, early-fusion and late-fusion, which align and fuse visual and textual features at the pixel level and block level. Benefitting from the concision and effectiveness of M2Doc, it can be easily applied to various detectors for better layout detection, including two-stage and end-to-end object detectors. Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as DocLayNet (+11.3 mAP) and M6Doc (+1.9 mAP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on DocLayNet (89.0 mAP), M6Doc (69.9 mAP), and PubLayNet (95.5 mAP). The code will be publicly released at https://github.com/johnning2333/M2Doc. \ No newline at end of file diff --git a/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning b/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning new file mode 100644 index 0000000000..a7a3e6beba --- /dev/null +++ b/data/2024/aaai/M2SD: Multiple Mixing Self-Distillation for Few-Shot Class-Incremental Learning @@ -0,0 +1 @@ +Few-shot Class-incremental learning (FSCIL) is a challenging task in machine learning that aims to recognize new classes from a limited number of instances while preserving the ability to classify previously learned classes without retraining the entire model. This presents challenges in updating the model with new classes using limited training data, particularly in balancing acquiring new knowledge while retaining the old. We propose a novel method named Multiple Mxing Self-Distillation (M2SD) during the training phase to address these issues. Specifically, we propose a dual-branch structure that facilitates the expansion of the entire feature space to accommodate new classes. Furthermore, we introduce a feature enhancement component that can pass additional enhanced information back to the base network by self-distillation, resulting in improved classification performance upon adding new classes. After training, we discard both structures, leaving only the primary network to classify new class instances. Extensive experiments demonstrate that our approach achieves superior performance over previous state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy b/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy new file mode 100644 index 0000000000..c577520660 --- /dev/null +++ b/data/2024/aaai/M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy @@ -0,0 +1 @@ +Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs. To address these challenges, dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset. Nowadays, optimization-oriented methods have been the primary method in the field of dataset condensation for achieving SOTA results. However, the bi-level optimization process hinders the practical application of such methods to realistic and larger datasets. To enhance condensation efficiency, previous works proposed Distribution-Matching (DM) as an alternative, which significantly reduces the condensation cost. Nonetheless, current DM-based methods still yield less comparable results to SOTA optimization-oriented methods. In this paper, we argue that existing DM-based methods overlook the higher-order alignment of the distributions, which may lead to sub-optimal matching results. Inspired by this, we present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy between feature representations of the synthetic and real images. By embedding their distributions in a reproducing kernel Hilbert space, we align all orders of moments of the distributions of real and synthetic images, resulting in a more generalized condensed set. Notably, our method even surpasses the SOTA optimization-oriented method IDC on the high-resolution ImageNet dataset. Extensive analysis is conducted to verify the effectiveness of the proposed method. Source codes are available at https://github.com/Hansong-Zhang/M3D. \ No newline at end of file diff --git a/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking b/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking new file mode 100644 index 0000000000..b1fab635bf --- /dev/null +++ b/data/2024/aaai/M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking @@ -0,0 +1 @@ +3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformer-based network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git. \ No newline at end of file diff --git a/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes b/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes new file mode 100644 index 0000000000..382e13aa17 --- /dev/null +++ b/data/2024/aaai/MA-Net: Rethinking Neural Unit in the Light of Astrocytes @@ -0,0 +1 @@ +The artificial neuron (N-N) model-based networks have accomplished extraordinary success for various vision tasks. However, as a simplification of the mammal neuron model, their structure is locked during training, resulting in overfitting and over-parameters. The astrocyte, newly explored by biologists, can adaptively modulate neuronal communication by inserting itself between neurons. The communication, between the astrocyte and neuron, is bidirectionally and shows the potential to alleviate issues raised by unidirectional communication in the N-N model. In this paper, we first elaborate on the artificial Multi-Astrocyte-Neuron (MA-N) model, which enriches the functionality of the artificial neuron model. Our MA-N model is formulated at both astrocyte- and neuron-level that mimics the bidirectional communication with temporal and joint mechanisms. Then, we construct the MA-Net network with the MA-N model, whose neural connections can be continuously and adaptively modulated during training. Experiments show that our MA-Net advances new state-of-the-art on multiple tasks while significantly reducing its parameters by connection optimization. \ No newline at end of file diff --git a/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery b/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery new file mode 100644 index 0000000000..bbec9f6b61 --- /dev/null +++ b/data/2024/aaai/MANDREL: Modular Reinforcement Learning Pipelines for Material Discovery @@ -0,0 +1 @@ +AI-driven materials discovery is evolving rapidly with new approaches and pipelines for experimentation and design. However, the pipelines are often designed in isolation. We introduce a modular reinforcement learning framework for inter-operable experimentation and design of tailored, novel molecular species. The framework unifies reinforcement learning (RL) pipelines and allows the mixing and matching of choices for the underlying chemical action space, molecular representation, desired molecular properties, and RL algorithm. Our demo showcases the framework's capabilities applied to benchmark problems like quantitative estimate of drug-likeness and PLogP, as well as the design of novel small molecule solvents for carbon capture. \ No newline at end of file diff --git "a/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees" "b/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees" new file mode 100644 index 0000000000..1b5b1ad7ca --- /dev/null +++ "b/data/2024/aaai/MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees" @@ -0,0 +1 @@ +Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree. \ No newline at end of file diff --git a/data/2024/aaai/MCA: Moment Channel Attention Networks b/data/2024/aaai/MCA: Moment Channel Attention Networks new file mode 100644 index 0000000000..e1c8af74f7 --- /dev/null +++ b/data/2024/aaai/MCA: Moment Channel Attention Networks @@ -0,0 +1 @@ +Channel attention mechanisms endeavor to recalibrate channel weights to enhance representation abilities of networks. However, mainstream methods often rely solely on global average pooling as the feature squeezer, which significantly limits the overall potential of models. In this paper, we investigate the statistical moments of feature maps within a neural network. Our findings highlight the critical role of high-order moments in enhancing model capacity. Consequently, we introduce a flexible and comprehensive mechanism termed Extensive Moment Aggregation (EMA) to capture the global spatial context. Building upon this mechanism, we propose the Moment Channel Attention (MCA) framework, which efficiently incorporates multiple levels of moment-based information while minimizing additional computation costs through our Cross Moment Convolution (CMC) module. The CMC module via channel-wise convolution layer to capture multiple order moment information as well as cross channel features. The MCA block is designed to be lightweight and easily integrated into a variety of neural network architectures. Experimental results on classical image classification, object detection, and instance segmentation tasks demonstrate that our proposed method achieves state-of-the-art results, outperforming existing channel attention methods. \ No newline at end of file diff --git a/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning b/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning new file mode 100644 index 0000000000..c7124d23e2 --- /dev/null +++ b/data/2024/aaai/MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning @@ -0,0 +1,11 @@ +Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English +data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named +Entity Recognition (MCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the +inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across +different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between +source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment +extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further +augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on +the XTREME benchmark, spanning 40 languages, demonstrate the superiority of MCL-NER over prior data-driven +and model-based approaches. It achieves a substantial increase of nearly +2.0 F1 scores across a broad spectrum and +establishes itself as the new state-of-the-art performer. \ No newline at end of file diff --git a/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music b/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music new file mode 100644 index 0000000000..78c9145694 --- /dev/null +++ b/data/2024/aaai/MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music @@ -0,0 +1,2 @@ +Singing melody extraction is an important task in the field of music information retrieval (MIR). The development of data-driven models for this task have achieved great successes. However, the existing models have two major limitations: firstly, most of the existing singing melody extraction models have formulated this task as a pixel-level prediction task. The lack of labeling data has limited the model for further improvements. Secondly, the generalization of the existing models are prone to be disturbed by the music genres. To address the issues mentioned above, in this paper, we propose a multi-Task contrastive learning framework for semi-supervised singing melody extraction, termed as MCSSME. +Specifically, to deal with data scarcity limitation, we propose a self-consistency regularization (SCR) method to train the model on the unlabeled data. Transformations are applied to the raw signal of polyphonic music, which makes the network to improve its representation capability via recognizing the transformations. We further propose a novel multi-task learning (MTL) approach to jointly learn singing melody extraction and classification of transformed data. To deal with generalization limitation, we also propose a contrastive embedding learning, which strengthens the intra-class compactness and inter-class separability. To improve the generalization on different music genres, we also propose a domain classification method to learn task-dependent features by mapping data from different music genres to shared subspace. MCSSME evaluates on a set of well-known public melody extraction datasets with promising performances. The experimental results demonstrate the effectiveness of the MCSSME framework for singing melody extraction from polyphonic music using very limited labeled data scenarios. \ No newline at end of file diff --git a/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning b/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning new file mode 100644 index 0000000000..5551b05961 --- /dev/null +++ b/data/2024/aaai/MDFL: Multi-Domain Diffusion-Driven Feature Learning @@ -0,0 +1 @@ +High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. Code available at https://github.com/LDXDU/MDFL-AAAI-24. \ No newline at end of file diff --git a/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction b/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction new file mode 100644 index 0000000000..69e7ab0e91 --- /dev/null +++ b/data/2024/aaai/MDGNN: Multi-Relational Dynamic Graph Neural Network for Comprehensive and Dynamic Stock Investment Prediction @@ -0,0 +1 @@ +The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graph-based models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods. \ No newline at end of file diff --git a/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information b/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information new file mode 100644 index 0000000000..dd2e8efdee --- /dev/null +++ b/data/2024/aaai/MEPSI: An MDL-Based Ensemble Pruning Approach with Structural Information @@ -0,0 +1 @@ +Ensemble pruning that combines a subset of individual learners generated in parallel to make predictions is an important topic in ensemble learning. Past decades have developed a lot of pruning algorithms that focus on the external behavior of learners on samples, which may lead to over-fitting. In this paper, we conjecture that the generalization performance of an ensemble is not only related to its external behavior on samples but also dependent on the internal structure of individual learners. We propose the general MEPSI approach based on Kolmogorov complexity and the Minimum Description Length (MDL) principle, which formulates the ensemble pruning task as the two-objective optimization problem that comprises the empirical error and structural information among individual learners. We also provide a concrete implementation of MEPSI on decision trees. The theoretical results provide generalization bounds for both the general MEPSI approach and tree-based implementation. The comparative experiments conducted on multiple real-world data sets demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/MERGE: Fast Private Text Generation b/data/2024/aaai/MERGE: Fast Private Text Generation new file mode 100644 index 0000000000..c047f8018f --- /dev/null +++ b/data/2024/aaai/MERGE: Fast Private Text Generation @@ -0,0 +1 @@ +The drastic increase in language models' parameters has led to a new trend of deploying models in cloud servers, raising growing concerns about private inference for Transformer-based models. Existing two-party privacy-preserving techniques, however, only take into account natural language understanding (NLU) scenarios. Private inference in natural language generation (NLG), crucial for applications like translation and code completion, remains underexplored. In addition, previous privacy-preserving techniques suffer from convergence issues during model training and exhibit poor inference speed when used with NLG models due to the neglect of time-consuming operations in auto-regressive generations. To address these issues, we propose a fast private text generation framework for Transformer-based language models, namely MERGE. MERGE reuses the output hidden state as the word embedding to bypass the embedding computation and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Extensive experiments show that MERGE achieves a 26.5x speedup to the vanilla encrypted model under the sequence length 512, and reduces 80% communication cost, with an up to 10x speedup to state-of-the-art approximated models. \ No newline at end of file diff --git a/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities b/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities new file mode 100644 index 0000000000..8b8f3e342a --- /dev/null +++ b/data/2024/aaai/MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities @@ -0,0 +1 @@ +The Entity Set Expansion (ESE) task aims to expand a handful of seed entities with new entities belonging to the same semantic class. Conventional ESE methods are based on mono-modality (i.e., literal modality), which struggle to deal with complex entities in the real world such as (1) Negative entities with fine-grained semantic differences. (2) Synonymous entities. (3) Polysemous entities. (4) Long-tailed entities. These challenges prompt us to propose novel Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. Intuitively, the benefits of multi-modal information for ESE are threefold: (1) Different modalities can provide complementary information. (2) Multi-modal information provides a unified signal via common visual properties for the same semantic class or entity. (3) Multi-modal information offers robust alignment signals for synonymous entities. To assess model performance in MESE, we constructed the MESED dataset which is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration. A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks. The extensive experiments and analyses on MESED demonstrate the high quality of the dataset and the effectiveness of our MultiExpan, as well as pointing the direction for future research. The benchmark and code are public at https://github.com/THUKElab/MESED. \ No newline at end of file diff --git a/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks b/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks new file mode 100644 index 0000000000..1aac76a58c --- /dev/null +++ b/data/2024/aaai/MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks @@ -0,0 +1 @@ +To better understand the output of deep neural networks (DNN), attribution based methods have been an important approach for model interpretability, which assign a score for each input dimension to indicate its importance towards the model outcome. Notably, the attribution methods use the ax- ioms of sensitivity and implementation invariance to ensure the validity and reliability of attribution results. Yet, the ex- isting attribution methods present challenges for effective in- terpretation and efficient computation. In this work, we in- troduce MFABA, an attribution algorithm that adheres to ax- ioms, as a novel method for interpreting DNN. Addition- ally, we provide the theoretical proof and in-depth analy- sis for MFABA algorithm, and conduct a large scale exper- iment. The results demonstrate its superiority by achieving over 101.5142 times faster speed than the state-of-the-art at- tribution algorithms. The effectiveness of MFABA is thor- oughly evaluated through the statistical analysis in compar- ison to other methods, and the full implementation package is open-source at: https://github.com/LMBTough/MFABA. \ No newline at end of file diff --git a/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation b/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation new file mode 100644 index 0000000000..600bfeba3b --- /dev/null +++ b/data/2024/aaai/MFOS: Model-Free & One-Shot Object Pose Estimation @@ -0,0 +1 @@ +Existing learning-based methods for object pose estimation in RGB images are mostly model-specific or category based. They lack the capability to generalize to new object categories at test time, hence severely hindering their practicability and scalability. Notably, recent attempts have been made to solve this issue, but they still require accurate 3D data of the object surface at both train and test time. In this paper, we introduce a novel approach that can estimate in a single forward pass the pose of objects never seen during training, given minimum input. In contrast to existing state-of-the-art approaches, which rely on task-specific modules, our proposed model is entirely based on a transformer architecture, which can benefit from recently proposed 3D-geometry general pretraining. We conduct extensive experiments and report state-of-the-art one-shot performance on the challenging LINEMOD benchmark. Finally, extensive ablations allow us to determine good practices with this relatively new type of architecture in the field. \ No newline at end of file diff --git a/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution b/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution new file mode 100644 index 0000000000..e37f7d2453 --- /dev/null +++ b/data/2024/aaai/MFTN: Multi-Level Feature Transfer Network Based on MRI-Transformer for MR Image Super-resolution @@ -0,0 +1 @@ +Due to the unique environment and inherent properties of magnetic resonance imaging (MRI) instruments, MR images typically have lower resolution. Therefore, improving the resolution of MR images is beneficial for assisting doctors in diagnosing the condition. Currently, the existing MR image super-resolution (SR) methods still have the problem of insufficient detail reconstruction. To overcome this issue, this paper proposes a multi-level feature transfer network (MFTN) based on MRI-Transformer to realize SR of low-resolution MRI data. MFTN consists of a multi-scale feature reconstruction network (MFRN) and a multi-level feature extraction branch (MFEB). MFRN is constructed as a pyramid structure to gradually reconstruct image features at different scales by integrating the features obtained from MFEB, and MFEB is constructed to provide detail information at different scales for low resolution MR image SR reconstruction by constructing multiple MRI-Transformer modules. Each MRI-Transformer module is designed to learn the transfer features from the reference image by establishing feature correlations between the reference image and low-resolution MR image. In addition, a contrast learning constraint item is added to the loss function to enhance the texture details of the SR image. A large number of experiments show that our network can effectively reconstruct high-quality MR Images and achieves better performance compared to some state-of-the-art methods. The source code of this work will be released on GitHub. \ No newline at end of file diff --git a/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs b/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs new file mode 100644 index 0000000000..2956645ac1 --- /dev/null +++ b/data/2024/aaai/MGNet: Learning Correspondences via Multiple Graphs @@ -0,0 +1 @@ +Learning correspondences aims to find correct correspondences (inliers) from the initial correspondence set with an uneven correspondence distribution and a low inlier rate, which can be regarded as graph data. Recent advances usually use graph neural networks (GNNs) to build a single type of graph or simply stack local graphs into the global one to complete the task. But they ignore the complementary relationship between different types of graphs, which can effectively capture potential relationships among sparse correspondences. To address this problem, we propose MGNet to effectively combine multiple complementary graphs. To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph. Moreover, we propose Graph Soft Degree Attention (GSDA) to make full use of all sparse correspondence information at once in the global graph, which can capture and amplify discriminative features. Extensive experiments demonstrate that MGNet outperforms state-of-the-art methods in different visual tasks. The code is provided in https://github.com/DAILUANYUAN/MGNet-2024AAAI. \ No newline at end of file diff --git a/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization b/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization new file mode 100644 index 0000000000..b7f661805f --- /dev/null +++ b/data/2024/aaai/MGQFormer: Mask-Guided Query-Based Transformer for Image Manipulation Localization @@ -0,0 +1 @@ +Deep learning-based models have made great progress in image tampering localization, which aims to distinguish between manipulated and authentic regions. However, these models suffer from inefficient training. This is because they use ground-truth mask labels mainly through the cross-entropy loss, which prioritizes per-pixel precision but disregards the spatial location and shape details of manipulated regions. To address this problem, we propose a Mask-Guided Query-based Transformer Framework (MGQFormer), which uses ground-truth masks to guide the learnable query token (LQT) in identifying the forged regions. Specifically, we extract feature embeddings of ground-truth masks as the guiding query token (GQT) and feed GQT and LQT into MGQFormer to estimate fake regions, respectively. Then we make MGQFormer learn the position and shape information in ground-truth mask labels by proposing a mask-guided loss to reduce the feature distance between GQT and LQT. We also observe that such mask-guided training strategy has a significant impact on the convergence speed of MGQFormer training. Extensive experiments on multiple benchmarks show that our method significantly improves over state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment b/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment new file mode 100644 index 0000000000..7558f8937d --- /dev/null +++ b/data/2024/aaai/MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment @@ -0,0 +1 @@ +Black-box deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis. The code is available at https://github.com/Tommy-Bie/MICA. \ No newline at end of file diff --git a/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways b/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways new file mode 100644 index 0000000000..9b096229cc --- /dev/null +++ b/data/2024/aaai/MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways @@ -0,0 +1 @@ +We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID-19-related news articles accompanied by comprehensive insights including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information. Besides discovering information flow patterns among users, we construct communities among users and develop the propagation forecasting capability, enabling tracing and understanding of how information is disseminated at a higher level. A demo video and more are available at https://info-pathways.github.io. \ No newline at end of file diff --git a/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation b/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation new file mode 100644 index 0000000000..63034b96c3 --- /dev/null +++ b/data/2024/aaai/MIND: Multi-Task Incremental Network Distillation @@ -0,0 +1 @@ +The recent surge of pervasive devices that generate dynamic data streams has underscored the necessity for learning systems to adapt continually to data distributional shifts. To tackle this challenge, the research community has put forth a spectrum of methodologies, including the demanding pursuit of class-incremental learning without replay data. In this study, we present MIND, a parameter isolation method that aims to significantly enhance the performance of replay-free solutions and achieve state-of-the-art results on several widely studied datasets. Our approach introduces two main contributions: two alternative distillation procedures that significantly improve the efficiency of MIND increasing the accumulated knowledge of each sub-network, and the optimization of the BachNorm layers across tasks inside the sub-networks. Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free Class-Incremental learning (with an increment in classification accuracy of approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx. +40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each contribution to demonstrate its impact on performance improvement. Our results showcase the superior performance of MIND indicating its potential for addressing the challenges posed by Class-incremental and Domain-Incremental learning in resource-constrained environments. \ No newline at end of file diff --git a/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs b/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs new file mode 100644 index 0000000000..85d6723321 --- /dev/null +++ b/data/2024/aaai/MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs @@ -0,0 +1 @@ +GraIL and its variants have shown their promising capacities for inductive relation reasoning on knowledge graphs. However, the uni-directional message-passing mechanism hinders such models from exploiting hidden mutual relations between entities in directed graphs. Besides, the enclosing subgraph extraction in most GraIL-based models restricts the model from extracting enough discriminative information for reasoning. Consequently, the expressive ability of these models is limited. To address the problems, we propose a novel GraIL-based framework, termed MINES, by introducing a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph. Concretely, the message intercommunication mechanism is designed to capture the omitted hidden mutual information. It introduces bi-directed information interactions between connected entities by inserting an undirected/bi-directed GCN layer between uni-directed RGCN layers. Moreover, inspired by the success of involving more neighbors in other graph-based tasks, we extend the neighborhood area beyond the enclosing subgraph to enhance the information collection for inductive relation reasoning. Extensive experiments prove the promising capacity of the proposed MINES from various aspects, especially for the superiority, effectiveness, and transfer ability. \ No newline at end of file diff --git a/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction b/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction new file mode 100644 index 0000000000..acb4a7cd28 --- /dev/null +++ b/data/2024/aaai/MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug-Drug Interaction Prediction @@ -0,0 +1 @@ +Taking incompatible multiple drugs together may cause adverse interactions and side effects on the body. Accurate prediction of drug-drug interaction (DDI) events is essential for avoiding this issue. Recently, various artificial intelligence-based approaches have been proposed for predicting DDI events. However, DDI events are associated with complex relationships and mechanisms among drugs, targets, enzymes, transporters, molecular structures, etc. Existing approaches either partially or loosely consider these relationships and mechanisms by a non-end-to-end learning framework, resulting in sub-optimal feature extractions and fusions for prediction. Different from them, this paper proposes a Multimodal Knowledge Graph Fused End-to-end Neural Network (MKGFENN) that consists of two main parts: multimodal knowledge graph (MKG) and fused end-to-end neural network (FENN). First, MKG is constructed by comprehensively exploiting DDI events-associated relationships and mechanisms from four knowledge graphs of drugs-chemical entities, drug-substructures, drugs-drugs, and molecular structures. Correspondingly, a four channels graph neural network is designed to extract high-order and semantic features from MKG. Second, FENN designs a multi-layer perceptron to fuse the extracted features by end-to-end learning. With such designs, the feature extractions and fusions of DDI events are guaranteed to be comprehensive and optimal for prediction. Through extensive experiments on real drug datasets, we demonstrate that MKG-FENN exhibits high accuracy and significantly outperforms state-of-the-art models in predicting DDI events. The source code and supplementary file of this article are available on: https://github.com/wudi1989/MKG-FENN. \ No newline at end of file diff --git a/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation b/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation new file mode 100644 index 0000000000..6ad3048229 --- /dev/null +++ b/data/2024/aaai/MLNet: Mutual Learning Network with Neighborhood Invariance for Universal Domain Adaptation @@ -0,0 +1 @@ +Universal domain adaptation (UniDA) is a practical but challenging problem, in which information about the relation between the source and the target domains is not given for knowledge transfer. Existing UniDA methods may suffer from the problems of overlooking intra-domain variations in the target domain and difficulty in separating between the similar known and unknown class. To address these issues, we propose a novel Mutual Learning Network (MLNet) with neighborhood invariance for UniDA. In our method, confidence-guided invariant feature learning with self-adaptive neighbor selection is designed to reduce the intra-domain variations for more generalizable feature representation. By using the cross-domain mixup scheme for better unknown-class identification, the proposed method compensates for the misidentified known-class errors by mutual learning between the closed-set and open-set classifiers. Extensive experiments on three publicly available benchmarks demonstrate that our method achieves the best results compared to the state-of-the-arts in most cases and significantly outperforms the baseline across all the four settings in UniDA. Code is available at https://github.com/YanzuoLu/MLNet. \ No newline at end of file diff --git a/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding b/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding new file mode 100644 index 0000000000..1e3166a97f --- /dev/null +++ b/data/2024/aaai/MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding @@ -0,0 +1 @@ +In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided. The richness and value of Multi-view 2D information can provide superior self-supervised signals for 3D objects. In this paper, we propose a novel self-supervised point cloud representation learning method, MM-Point, which is driven by intra-modal and inter-modal similarity objectives. The core of MM-Point lies in the Multi-modal interaction and transmission between 3D objects and multiple 2D views at the same time. In order to more effectively simultaneously perform the consistent cross-modal objective of 2D multi-view information based on contrastive learning, we further propose Multi-MLP and Multi-level Augmentation strategies. Through carefully designed transformation strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point demonstrates state-of-the-art (SOTA) performance in various downstream tasks. For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN, comparable to fully supervised methods. Additionally, we demonstrate its effectiveness in tasks such as few-shot classification, 3D part segmentation and 3D semantic segmentation. \ No newline at end of file diff --git a/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis b/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis new file mode 100644 index 0000000000..dbbb2399bf --- /dev/null +++ b/data/2024/aaai/MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis @@ -0,0 +1 @@ +The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects: 1) aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2) efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions (SAConv) to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io. \ No newline at end of file diff --git a/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract) b/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract) new file mode 100644 index 0000000000..dfb49c73fd --- /dev/null +++ b/data/2024/aaai/MRMLREC: A Two-Stage Approach for Addressing Data Sparsity in MOOC Video Recommendation (Student Abstract) @@ -0,0 +1 @@ +With the abundance of learning resources available on massive open online courses (MOOCs) platforms, the issue of interactive data sparsity has emerged as a significant challenge.This paper introduces MRMLREC, an efficient MOOC video recommendation which consists of two main stages: multi-relational representation and multi-level recommendation, aiming to solve the problem of data sparsity. In the multi-relational representation stage, MRMLREC adopts a tripartite approach, constructing relational graphs based on temporal sequences, courses-videos relation, and knowledge concepts-video relation. These graphs are processed by a Graph Convolution Network (GCN) and two variant Graph Attention Networks (GAT) to derive representations. A variant of the Long Short-Term Memory Network (LSTM) then integrates these multi-dimensional data to enhance the overall representation. The multi-level recommendation stage introduces three prediction tasks at varying levels—courses, knowledge concepts, and videos—to mitigate data sparsity and improve the interpretability of video recommendations. Beam search (BS) is employed to identify top-β items at each level, refining the subsequent level's search space and enhancing recommendation efficiency. Additionally, an optional layer offers both personalization and diversification modes, ensuring variety in recommended videos and maintaining learner engagement. Comprehensive experiments demonstrate the effectiveness of MRMLREC on two real-world instances from Xuetang X. \ No newline at end of file diff --git a/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting b/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting new file mode 100644 index 0000000000..77de260e5d --- /dev/null +++ b/data/2024/aaai/MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting @@ -0,0 +1 @@ +Multivariate time series forecasting poses an ongoing challenge across various disciplines. Time series data often exhibit diverse intra-series and inter-series correlations, contributing to intricate and interwoven dependencies that have been the focus of numerous studies. Nevertheless, a significant research gap remains in comprehending the varying inter-series correlations across different time scales among multiple time series, an area that has received limited attention in the literature. To bridge this gap, this paper introduces MSGNet, an advanced deep learning model designed to capture the varying inter-series correlations across multiple time scales using frequency domain analysis and adaptive graph convolution. By leveraging frequency domain analysis, MSGNet effectively extracts salient periodic patterns and decomposes the time series into distinct time scales. The model incorporates a self-attention mechanism to capture intra-series dependencies, while introducing an adaptive mixhop graph convolution layer to autonomously learn diverse inter-series correlations within each time scale. Extensive experiments are conducted on several real-world datasets to showcase the effectiveness of MSGNet. Furthermore, MSGNet possesses the ability to automatically learn explainable multi-scale inter-series correlations, exhibiting strong generalization capabilities even when applied to out-of-distribution samples. \ No newline at end of file diff --git a/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving b/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving new file mode 100644 index 0000000000..d8d11eecd9 --- /dev/null +++ b/data/2024/aaai/MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving @@ -0,0 +1 @@ +Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label correction modules for both 2D and 3D modalities, along with a new multimodal cross-supervision approach. In the 2D pseudo label generation branch, the Instance-based Pseudo Mask Generation (IPG) module utilizes predictions for self-supervised correction. Similarly, in the 3D pseudo label generation branch, the Spatial-based Pseudo Label Generation (SPG) module generates pseudo labels by incorporating the spatial prior information of the point cloud. To further refine the generated pseudo labels, the Point-based Voting Label Correction (PVC) module utilizes historical predictions for correction. Additionally, a Ring Segment-based Label Correction (RSC) module is proposed to refine the predictions by leveraging the depth prior information from the point cloud. Finally, the Consistency Sparse Cross-modal Supervision (CSCS) module reduces the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin. \ No newline at end of file diff --git a/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction b/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction new file mode 100644 index 0000000000..f914c054b8 --- /dev/null +++ b/data/2024/aaai/Machine Learning-Powered Combinatorial Clock Auction @@ -0,0 +1 @@ +We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., “What is your value for the bundle {A, B}?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., “At prices p, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA. \ No newline at end of file diff --git a/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer b/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer new file mode 100644 index 0000000000..19c091adb6 --- /dev/null +++ b/data/2024/aaai/Machine-Created Universal Language for Cross-Lingual Transfer @@ -0,0 +1 @@ +There are two primary approaches to addressing cross-lingual transfer: multilingual pre-training, which implicitly aligns the hidden representations of various languages, and translate-test, which explicitly translates different languages into an intermediate language, such as English. Translate-test offers better interpretability compared to multilingual pre-training. However, it has lower performance than multilingual pre-training and struggles with word-level tasks due to translation altering word order. As a result, we propose a new Machine-created Universal Language (MUL) as an alternative intermediate language. MUL comprises a set of discrete symbols forming a universal vocabulary and a natural language to MUL translator for converting multiple natural languages to MUL. MUL unifies shared concepts from various languages into a single universal word, enhancing cross-language transfer. Additionally, MUL retains language-specific words and word order, allowing the model to be easily applied to word-level tasks. Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL. \ No newline at end of file diff --git a/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization b/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization new file mode 100644 index 0000000000..e28ab52c9a --- /dev/null +++ b/data/2024/aaai/MagiCapture: High-Resolution Multi-Concept Portrait Customization @@ -0,0 +1 @@ +Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects. \ No newline at end of file diff --git a/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images b/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images new file mode 100644 index 0000000000..df931db2f9 --- /dev/null +++ b/data/2024/aaai/Make Lossy Compression Meaningful for Low-Light Images @@ -0,0 +1 @@ +Low-light images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. To achieve better visibility for visual perception, low-light image enhancement is usually adopted. Besides, lossy image compression is vital for meeting the requirements of storage and transmission in computer vision applications. To touch the above two practical demands, current solutions can be categorized into two sequential manners: ``Compress before Enhance (CbE)'' or ``Enhance before Compress (EbC)''. However, both of them are not suitable since: (1) Error accumulation in the individual models plagues sequential solutions. Especially, once low-light images are compressed by existing general lossy image compression approaches, useful information (e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. (2) Due to the intermediate process, the sequential solution introduces an additional burden resulting in low efficiency. We propose a novel joint solution to simultaneously achieve a high compression rate and good enhancement performance for low-light images with much lower computational cost and fewer model parameters. We design an end-to-end trainable architecture, which includes the main enhancement branch and the signal-to-noise ratio (SNR) aware branch. Experimental results show that our proposed joint solution achieves a significant improvement over different combinations of existing state-of-the-art sequential ``Compress before Enhance'' or ``Enhance before Compress'' solutions for low-light images, which would make lossy low-light image compression more meaningful. The project is publicly available at: https://github.com/CaiShilv/Joint-IC-LL. \ No newline at end of file diff --git a/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior b/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior new file mode 100644 index 0000000000..eade4ce4ad --- /dev/null +++ b/data/2024/aaai/Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior @@ -0,0 +1 @@ +Recent vision-language pre-trained (VLP) models have become the backbone for many downstream tasks, but they are utilized as frozen model without learning. Prompt learning is a method to improve the pre-trained VLP model by adding a learnable context vector to the inputs of the text encoder. In a few-shot learning scenario of the downstream task, MLE training can lead the context vector to over-fit dominant image features in the training data. This overfitting can potentially harm the generalization ability, especially in the presence of a distribution shift between the training and test dataset. This paper presents a Bayesian-based framework of prompt tuning, which could alleviate the over-fitting issues on few-shot learning application and increase the adaptability of prompts on unobserved instances. Specifically, modeling data-dependent prior enhances the adaptability of text features for both seen and unseen image features without the trade-off of performance between them. Based on the Bayesian framework, we utilize the Wasserstein gradient flow in the estimation of our target posterior distribution, which enables our prompt to be flexible in capturing the complex modes of image features. We demonstrate the effectiveness of our method on benchmark datasets for several experiments by showing statistically significant improvements on performance compared to existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach b/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach new file mode 100644 index 0000000000..c0af268392 --- /dev/null +++ b/data/2024/aaai/Make RepVGG Greater Again: A Quantization-Aware Approach @@ -0,0 +1 @@ +The tradeoff between performance and inference speed is critical for practical applications. Architecture reparameterization obtains better tradeoffs and it is becoming an increasingly popular ingredient in modern convolutional neural networks. Nonetheless, its quantization performance is usually too poor to deploy (e.g. more than 20% top-1 accuracy drop on ImageNet) when INT8 inference is desired. In this paper, we dive into the underlying mechanism of this failure, where the original design inevitably enlarges quantization error. We propose a simple, robust, and effective remedy to have a quantization-friendly structure that also enjoys reparameterization benefits. Our method greatly bridges the gap between INT8 and FP32 accuracy for RepVGG. Without bells and whistles, the top-1 accuracy drop on ImageNet is reduced within 2% by standard post-training quantization. Extensive experiments on detection and semantic segmentation tasks verify its generalization. \ No newline at end of file diff --git a/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations b/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations new file mode 100644 index 0000000000..5c31da8b79 --- /dev/null +++ b/data/2024/aaai/Making AI Policies Transparent to Humans through Demonstrations @@ -0,0 +1 @@ +Demonstrations are a powerful way of increasing the transparency of AI policies to humans. Though we can approximately model human learning from demonstrations as inverse reinforcement learning, we note that human learning can differ from algorithmic learning in key ways, e.g. humans are computationally limited and may sometimes struggle to understand all of the nuances of a demonstration. Unlike related work that provide demonstrations to humans that simply maximize information gain, I leverage concepts from the human education literature, such as the zone of proximal development and scaffolding, to show demonstrations that balance informativeness and difficulty of understanding to maximize human learning. \ No newline at end of file diff --git a/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful b/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful new file mode 100644 index 0000000000..3c47826f96 --- /dev/null +++ b/data/2024/aaai/Making Natural Language Reasoning Explainable and Faithful @@ -0,0 +1 @@ +Neural models, including large language models (LLMs), achieve superior performance on logical reasoning tasks such as question answering. To elicit reasoning capabilities from LLMs, recent works propose using the chain-of-thought (CoT) mechanism to generate both the reasoning chain and the answer, which enhances the model’s capabilities in conducting reasoning. However, due to LLM’s uninterpretable nature and the extreme flexibility of free-form explanations, several challenges remain: such as struggling with inaccurate reasoning, hallucinations, and not aligning with human preferences. In this talk, we will focus on (1) our design of leveraging structured information (that is grounded to the context), for the explainable complex question answering and reasoning; (2) our multi-module interpretable framework for inductive reasoning, which conducts step-wise faithful reasoning with iterative feedback. \ No newline at end of file diff --git a/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds b/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds new file mode 100644 index 0000000000..500e6b8ffe --- /dev/null +++ b/data/2024/aaai/Manifold Constraints for Imperceptible Adversarial Attacks on Point Clouds @@ -0,0 +1 @@ +Adversarial attacks on 3D point clouds often exhibit unsatisfactory imperceptibility, which primarily stems from the disregard for manifold-aware distortion, i.e., distortion of the underlying 2-manifold surfaces. In this paper, we develop novel manifold constraints to reduce such distortion, aiming to enhance the imperceptibility of adversarial attacks on 3D point clouds. Specifically, we construct a bijective manifold mapping between point clouds and a simple parameter shape using an invertible auto-encoder. Consequently, manifold-aware distortion during attacks can be captured within the parameter space. By enforcing manifold constraints that preserve local properties of the parameter shape, manifold-aware distortion is effectively mitigated, ultimately leading to enhanced imperceptibility. Extensive experiments demonstrate that integrating manifold constraints into conventional adversarial attack solutions yields superior imperceptibility, outperforming the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification b/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification new file mode 100644 index 0000000000..8e0586cf24 --- /dev/null +++ b/data/2024/aaai/Manifold-Based Verbalizer Space Re-embedding for Tuning-Free Prompt-Based Classification @@ -0,0 +1 @@ +Prompt-based classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models. \ No newline at end of file diff --git a/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies b/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies new file mode 100644 index 0000000000..38040f6bc6 --- /dev/null +++ b/data/2024/aaai/Manipulation-Robust Selection of Citizens' Assemblies @@ -0,0 +1 @@ +Among the recent work on designing algorithms for selecting citizens' assembly participants, one key property of these algorithms has not yet been studied: their manipulability. Strategic manipulation is a concern because these algorithms must satisfy representation constraints according to volunteers' self-reported features; misreporting these features could thereby increase a volunteer's chance of being selected, decrease someone else's chance, and/or increase the expected number of seats given to their group. Strikingly, we show that Leximin — an algorithm that is widely used for its fairness — is highly manipulable in this way. We then introduce a new class of selection algorithms that use Lp norms as objective functions. We show that the manipulability of the Lp-based algorithm decreases in O(1/n^(1-1/p)) as the number of volunteers n grows, approaching the optimal rate of O(1/n) as p approaches infinity. These theoretical results are confirmed via experiments in eight real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract) b/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract) new file mode 100644 index 0000000000..296847b024 --- /dev/null +++ b/data/2024/aaai/MapLE: Matching Molecular Analogues Promptly with Low Computational Resources by Multi-Metrics Evaluation (Student Abstract) @@ -0,0 +1 @@ +Matching molecular analogues is a computational chemistry and bioinformatics research issue which is used to identify molecules that are structurally or functionally similar to a target molecule. Recent studies on matching analogous molecules have predominantly concentrated on enhancing effectiveness, often sidelining computational efficiency, particularly in contexts of low computational resources. This oversight poses challenges in many real applications (e.g., drug discovery, catalyst generation and so forth). To tackle this issue, we propose a general strategy named MapLE, aiming to promptly match analogous molecules with low computational resources by multi-metrics evaluation. Experimental evaluation conducted on a public biomolecular dataset validates the excellent and efficient performance of the proposed strategy. \ No newline at end of file diff --git a/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation b/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation new file mode 100644 index 0000000000..18bbe9c66b --- /dev/null +++ b/data/2024/aaai/Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation @@ -0,0 +1,3 @@ +Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multi-homography damages the plane distribution of the scene, and does not fully address the restriction to use homography. + +In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo. \ No newline at end of file diff --git a/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation b/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation new file mode 100644 index 0000000000..a5242c3f28 --- /dev/null +++ b/data/2024/aaai/MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation @@ -0,0 +1 @@ +Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (e.g. mean of K-shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and K-shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff. \ No newline at end of file diff --git a/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models b/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models new file mode 100644 index 0000000000..1d7a788e82 --- /dev/null +++ b/data/2024/aaai/Mastering Context-to-Label Representation Transformation for Event Causality Identification with Diffusion Models @@ -0,0 +1 @@ +To understand event structures of documents, event causality identification (ECI) emerges as a crucial task, aiming to discern causal relationships among event mentions. The latest approach for ECI has introduced advanced deep learning models where transformer-based encoding models, complemented by enriching components, are typically leveraged to learn effective event context representations for causality prediction. As such, an important step for ECI models is to transform the event context representations into causal label representations to perform logits score computation for training and inference purposes. Within this framework, event context representations might encapsulate numerous complicated and noisy structures due to the potential long context between the input events while causal label representations are intended to capture pure information about the causal relations to facilitate score estimation. Nonetheless, a notable drawback of existing ECI models stems from their reliance on simple feed-forward networks to handle the complex context-to-label representation transformation process, which might require drastic changes in the representations to hinder the learning process. To overcome this issue, our work introduces a novel method for ECI where, instead abrupt transformations, event context representations are gradually updated to achieve effective label representations. This process will be done incrementally to allow filtering of irrelevant structures at varying levels of granularity for causal relations. To realize this, we present a diffusion model to learn gradual representation transition processes between context and causal labels. It operates through a forward pass for causal label representation noising and a reverse pass for reconstructing label representations from random noise. Our experiments on different datasets across multiple languages demonstrate the advantages of the diffusion model with state-of-the-art performance for ECI. \ No newline at end of file diff --git a/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection b/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection new file mode 100644 index 0000000000..c5577434fe --- /dev/null +++ b/data/2024/aaai/MatchDet: A Collaborative Framework for Image Matching and Object Detection @@ -0,0 +1 @@ +Image matching and object detection are two fundamental and challenging tasks, while many related applications consider them two individual tasks (i.e. task-individual). In this paper, a collaborative framework called MatchDet (i.e. task-collaborative) is proposed for image matching and object detection to obtain mutual improvements. To achieve the collaborative learning of the two tasks, we propose three novel modules, including a Weighted Spatial Attention Module (WSAM) for Detector, and Weighted Attention Module (WAM) and Box Filter for Matcher. Specifically, the WSAM highlights the foreground regions of target image to benefit the subsequent detector, the WAM enhances the connection between the foreground regions of pair images to ensure high-quality matches, and Box Filter mitigates the impact of false matches. We evaluate the approaches on a new benchmark with two datasets called Warp-COCO and miniScanNet. Experimental results show our approaches are effective and achieve competitive improvements. \ No newline at end of file diff --git a/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability b/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability new file mode 100644 index 0000000000..ce6db335b0 --- /dev/null +++ b/data/2024/aaai/MathAttack: Attacking Large Language Models towards Math Solving Ability @@ -0,0 +1 @@ +With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the robustness of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of robustness in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. The code and dataset is available at: https://github.com/zhouzihao501/MathAttack. \ No newline at end of file diff --git a/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract) b/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract) new file mode 100644 index 0000000000..e31c01feb1 --- /dev/null +++ b/data/2024/aaai/MaxEnt Loss: Calibrating Graph Neural Networks under Out-of-Distribution Shift (Student Abstract) @@ -0,0 +1 @@ +We present a new, simple and effective loss function for calibrating graph neural networks (GNNs). Miscalibration is the problem whereby a model's probabilities does not reflect it's correctness, making it difficult and possibly dangerous for real-world deployment. We compare our method against other baselines on a novel ID and OOD graph form of the Celeb-A faces dataset. Our findings show that our method improves calibration for GNNs, which are not immune to miscalibration in-distribution (ID) and out-of-distribution (OOD). Our code is available for review at https://github.com/dexterdley/CS6208/tree/main/Project. \ No newline at end of file diff --git a/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift b/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift new file mode 100644 index 0000000000..56670af44f --- /dev/null +++ b/data/2024/aaai/MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift @@ -0,0 +1 @@ +We present a new loss function that addresses the out-of-distribution (OOD) network calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks. Our code is available at https://github.com/dexterdley/MaxEnt-Loss. \ No newline at end of file diff --git a/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods b/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods new file mode 100644 index 0000000000..36a3360c71 --- /dev/null +++ b/data/2024/aaai/Maxileximin Envy Allocations and Connected Goods @@ -0,0 +1,3 @@ +Fair allocation of indivisible goods presents intriguing challenges from both a social choice perspective and an algorithmic standpoint. Due to the indivisibility of goods, it is common for one agent to envy the bundle of goods assigned to another agent and, indeed, envy-free solutions do not exist in general. In line with the classical game-theoretic concept of Nucleolus in coalitional games, we propose that a fair allocation should minimize the agents’ dissatisfaction profile in a lexicographic manner, where the dissatisfaction of an agent is defined as her maximum envy towards other agents. Therefore, we seek allocations that minimize the maximum envy. In cases where multiple solutions have an equal maximum value, we minimize the second-worst value, and so on. Additionally, as is customary in fair division problems, we also consider an efficiency requirement: among the allocations with the best agents’ dissatisfaction profile, we prioritize those that maximize the sum of agents’ utilities, known as maximum social welfare. Such allocations, referred to as maxileximin allocations, always exist. +In this study, we analyze the computational properties of maxileximin allocations in the context of fair allocation problems with constraints. Specifically, we focus on the Connected Fair Division problem, where goods correspond to the nodes of a graph, and a bundle of goods is allowed if the subgraph formed by those goods is connected. We demonstrate that the problem is F∆P2 -complete, even for instances with simple graphical structures such as path and star graphs. +However, we identify islands of tractability for instances with more intricate graphs, such as those having bounded treewidth, provided that the number of agents is bounded by a fixed number and utility functions use small values. \ No newline at end of file diff --git a/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems b/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems new file mode 100644 index 0000000000..cc988e2a8a --- /dev/null +++ b/data/2024/aaai/Maximizing the Success Probability of Policy Allocations in Online Systems @@ -0,0 +1 @@ +The effectiveness of advertising in e-commerce largely depends on the ability of merchants to bid on and win impressions for their targeted users. The bidding procedure is highly complex due to various factors such as market competition, user behavior, and the diverse objectives of advertisers. In this paper we consider the problem at the level of user timelines instead of individual bid requests, manipulating full policies (i.e. pre-defined bidding strategies) and not bid values. In order to optimally allocate policies to users, typical multiple treatments allocation methods solve knapsack-like problems which aim at maximizing an expected value under constraints. In the specific context of online advertising, we argue that optimizing for the probability of success is a more suited objective than expected value maximization, and we introduce the SuccessProbaMax algorithm that aims at finding the policy allocation which is the most likely to outperform a fixed reference policy. Finally, we conduct comprehensive experiments both on synthetic and real-world data to evaluate its performance. The results demonstrate that our proposed algorithm outperforms conventional expected-value maximization algorithms in terms of success rate. \ No newline at end of file diff --git a/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance b/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance new file mode 100644 index 0000000000..8ab881c030 --- /dev/null +++ b/data/2024/aaai/MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance @@ -0,0 +1 @@ +This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io \ No newline at end of file diff --git a/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework b/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework new file mode 100644 index 0000000000..9947886204 --- /dev/null +++ b/data/2024/aaai/Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework @@ -0,0 +1 @@ +Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM. \ No newline at end of file diff --git a/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features b/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features new file mode 100644 index 0000000000..a5ac41477b --- /dev/null +++ b/data/2024/aaai/Measuring Self-Supervised Representation Quality for Downstream Classification Using Discriminative Features @@ -0,0 +1 @@ +Self-supervised learning (SSL) has shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure modes and interpreting their learned representations. In this paper, we study the representation space of state-of-the-art self-supervised models including SimCLR, SwaV, MoCo, BYOL, DINO, SimSiam, VICReg and Barlow Twins. Without the use of class label information, we discover discriminative features that correspond to unique physical attributes in images, present mostly in correctly-classified representations. Using these features, we can compress the representation space by up to$40% without significantly affecting linear classification performance. We then propose Self-Supervised Representation Quality Score (or Q-Score), an unsupervised score that can reliably predict if a given sample is likely to be mis-classified during linear evaluation, achieving AUPRC of 91.45 on ImageNet-100 and 78.78 on ImageNet-1K. Q-Score can also be used as a regularization term on pre-trained encoders to remedy low-quality representations. Fine-tuning with Q-Score regularization can boost the linear probing accuracy of SSL models by up to 5.8% on ImageNet-100 and 3.7% on ImageNet-1K compared to their baselines. Finally, using gradient heatmaps and Salient ImageNet masks, we define a metric to quantify the interpretability of each representation. We show that discriminative features are strongly correlated to core attributes and, enhancing these features through Q-score regularization makes SSL representations more interpretable. \ No newline at end of file diff --git a/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks b/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks new file mode 100644 index 0000000000..e67d52f37f --- /dev/null +++ b/data/2024/aaai/Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks @@ -0,0 +1 @@ +The paradigm of pre-training and fine-tuning graph neural networks has attracted wide research attention. In previous studies, the pre-trained models are viewed as universally versatile, and applied for a diverse range of downstream tasks. In many situations, however, this practice results in limited or even negative transfer. This paper, for the first time, emphasizes the specific application scope of graph pre-trained models: not all downstream tasks can effectively benefit from a graph pre-trained model. In light of this, we introduce the measure task consistency to quantify the similarity between graph pre-training and downstream tasks. This measure assesses the extent to which downstream tasks can benefit from specific pre-training tasks. Moreover, a novel fine-tuning strategy, Bridge-Tune, is proposed to further diminish the impact of the difference between pre-training and downstream tasks. The key innovation in Bridge-Tune is an intermediate step that bridges pre-training and downstream tasks. This step takes into account the task differences and further refines the pre-trained model. The superiority of the presented fine-tuning strategy is validated via numerous experiments with different pre-trained models and downstream tasks. \ No newline at end of file diff --git a/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records b/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records new file mode 100644 index 0000000000..c73f404e50 --- /dev/null +++ b/data/2024/aaai/MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records @@ -0,0 +1 @@ +The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences. \ No newline at end of file diff --git a/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models b/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models new file mode 100644 index 0000000000..53b3673e7d --- /dev/null +++ b/data/2024/aaai/MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models @@ -0,0 +1 @@ +The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community. \ No newline at end of file diff --git a/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer b/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer new file mode 100644 index 0000000000..91b654927a --- /dev/null +++ b/data/2024/aaai/MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer @@ -0,0 +1 @@ +The Diffusion Probabilistic Model (DPM) has recently gained popularity in the field of computer vision, thanks to its image generation applications, such as Imagen, Latent Diffusion Models, and Stable Diffusion, which have demonstrated impressive capabilities and sparked much discussion within the community. Recent investigations have further unveiled the utility of DPM in the domain of medical image analysis, as underscored by the commendable performance exhibited by the medical image segmentation model across various tasks. Although these models were originally underpinned by a UNet architecture, there exists a potential avenue for enhancing their performance through the integration of vision transformer mechanisms. However, we discovered that simply combining these two models resulted in subpar performance. To effectively integrate these two cutting-edge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify its effectiveness on 20 medical image segmentation tasks with different image modalities. Through comprehensive evaluation, our approach demonstrates superiority over prior state-of-the-art (SOTA) methodologies. Code is released at https://github.com/KidsWithTokens/MedSegDiff. \ No newline at end of file diff --git a/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games b/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games new file mode 100644 index 0000000000..efb9cf45d8 --- /dev/null +++ b/data/2024/aaai/Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games @@ -0,0 +1 @@ +Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other's utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria. \ No newline at end of file diff --git a/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract) b/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract) new file mode 100644 index 0000000000..9009c7df45 --- /dev/null +++ b/data/2024/aaai/Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract) @@ -0,0 +1 @@ +The Transformer architecture has seen a lot of attention in recent years also thanks to its ability to scale well and allow massive parallelism during training. This has made possible the development of Language Models (LMs) of increasing size and the discovery of latent abilities that completely outclass traditional methods e.g. rule-based systems. However, they also introduced new issues, like their inability to retain the history of previous interactions due to their stateless nature or the difficulty in controlling their generation. Different attempts have been made to address these issues, e.g. a `brute force' approach to solving the memory issue is to include the full conversation history in the context window, a solution that is limited by the quadratic scalability of Transformers. In this work, we explore computationally practical solutions to the memory problem. We propose to augment the decoder-only architecture of (most) Large LMs with a (relatively small) memory encoder. Its output is prepended to the decoder's input in a similar fashion to recent works in Adapters and the original Transformer architecture. Initial experiments show promising results, however future work is needed to compare with State-of-the-Art methods. \ No newline at end of file diff --git a/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification b/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification new file mode 100644 index 0000000000..98819f1d4f --- /dev/null +++ b/data/2024/aaai/Memory-Efficient Prompt Tuning for Incremental Histopathology Classification @@ -0,0 +1 @@ +Recent studies have made remarkable progress in histopathology classification. Based on current successes, contemporary works proposed to further upgrade the model towards a more generalizable and robust direction through incrementally learning from the sequentially delivered domains. Unlike previous parameter isolation based approaches that usually demand massive computation resources during model updating, we present a memory-efficient prompt tuning framework to cultivate model generalization potential in economical memory cost. For each incoming domain, we reuse the existing parameters of the initial classification model and attach lightweight trainable prompts into it for customized tuning. Considering the domain heterogeneity, we perform decoupled prompt tuning, where we adopt a domain-specific prompt for each domain to independently investigate its distinctive characteristics, and one domain-invariant prompt shared across all domains to continually explore the common content embedding throughout time. All domain-specific prompts will be appended to the prompt bank and isolated from further changes to prevent forgetting the distinctive features of early-seen domains. While the domain-invariant prompt will be passed on and iteratively evolve by style-augmented prompt refining to improve model generalization capability over time. In specific, we construct a graph with existing prompts and build a style-augmented graph attention network to guide the domain-invariant prompt exploring the overlapped latent embedding among all delivered domains for more domain-generic representations. We have extensively evaluated our framework with two histopathology tasks, i.e., breast cancer metastasis classification and epithelium-stroma tissue classification, where our approach yielded superior performance and memory efficiency over the competing methods. \ No newline at end of file diff --git a/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks b/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks new file mode 100644 index 0000000000..675e9d53b6 --- /dev/null +++ b/data/2024/aaai/Memory-Efficient Reversible Spiking Neural Networks @@ -0,0 +1 @@ +Spiking neural networks (SNNs) are potential competitors to artificial neural networks (ANNs) due to their high energy-efficiency on neuromorphic hardware. However, SNNs are unfolded over simulation time steps during the training process. Thus, SNNs require much more memory than ANNs, which impedes the training of deeper SNN models. In this paper, we propose the reversible spiking neural network to reduce the memory cost of intermediate activations and membrane potentials during training. Firstly, we extend the reversible architecture along temporal dimension and propose the reversible spiking block, which can reconstruct the computational graph and recompute all intermediate variables in forward pass with a reverse process. On this basis, we adopt the state-of-the-art SNN models to the reversible variants, namely reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer). Through experiments on static and neuromorphic datasets, we demonstrate that the memory cost per image of our reversible SNNs does not increase with the network depth. On CIFAR10 and CIFAR100 datasets, our RevSResNet37 and RevSFormer-4-384 achieve comparable accuracies and consume 3.79x and 3.00x lower GPU memory per image than their counterparts with roughly identical model complexity and parameters. We believe that this work can unleash the memory constraints in SNN training and pave the way for training extremely large and deep SNNs. \ No newline at end of file diff --git a/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory b/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory new file mode 100644 index 0000000000..5fb68d0508 --- /dev/null +++ b/data/2024/aaai/MemoryBank: Enhancing Large Language Models with Long-Term Memory @@ -0,0 +1 @@ +Large Language Models (LLMs) have drastically reshaped our interactions with artificial intelligence (AI) systems, showcasing impressive performance across an extensive array of tasks. Despite this, a notable hindrance remains—the deficiency of a long-term memory mechanism within these models. This shortfall becomes increasingly evident in situations demanding sustained interaction, such as personal companion systems, psychological counseling, and secretarial assistance. Recognizing the necessity for long-term memory, we propose MemoryBank, a novel memory mechanism tailored for LLMs. MemoryBank enables the models to summon relevant memories, continually evolve through continuous memory updates, comprehend, and adapt to a user's personality over time by synthesizing information from previous interactions. To mimic anthropomorphic behaviors and selectively preserve memory, MemoryBank incorporates a memory updating mechanism, inspired by the Ebbinghaus Forgetting Curve theory. This mechanism permits the AI to forget and reinforce memory based on time elapsed and the relative significance of the memory, thereby offering a more human-like memory mechanism and enriched user experience. MemoryBank is versatile in accommodating both closed-source models like ChatGPT and open-source models such as ChatGLM. To validate MemoryBank's effectiveness, we exemplify its application through the creation of an LLM-based chatbot named SiliconFriend in a long-term AI Companion scenario. Further tuned with psychological dialog data, SiliconFriend displays heightened empathy and discernment in its interactions. Experiment involves both qualitative analysis with real-world user dialogs and quantitative analysis with simulated dialogs. In the latter, ChatGPT acts as multiple users with diverse characteristics and generates long-term dialog contexts covering a wide array of topics. The results of our analysis reveal that SiliconFriend, equipped with MemoryBank, exhibits a strong capability for long-term companionship as it can provide emphatic response, recall relevant memories and understand user personality. \ No newline at end of file diff --git a/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database b/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database new file mode 100644 index 0000000000..8a6de66c06 --- /dev/null +++ b/data/2024/aaai/Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database @@ -0,0 +1 @@ +This article presents the Political Deepfakes Incidents Database (PDID), a collection of politically-salient deepfakes, encompassing synthetically-created videos, images, and less-sophisticated `cheapfakes.' The project is driven by the rise of generative AI in politics, ongoing policy efforts to address harms, and the need to connect AI incidents and political communication research. The database contains political deepfake content, metadata, and researcher-coded descriptors drawn from political science, public policy, communication, and misinformation studies. It aims to help reveal the prevalence, trends, and impact of political deepfakes, such as those featuring major political figures or events. The PDID can benefit policymakers, researchers, journalists, fact-checkers, and the public by providing insights into deepfake usage, aiding in regulation, enabling in-depth analyses, supporting fact-checking and trust-building efforts, and raising awareness of political deepfakes. It is suitable for research and application on media effects, political discourse, AI ethics, technology governance, media literacy, and countermeasures. \ No newline at end of file diff --git a/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract) b/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract) new file mode 100644 index 0000000000..0853fa1c0d --- /dev/null +++ b/data/2024/aaai/Meta-Crafting: Improved Detection of Out-of-Distributed Texts via Crafting Metadata Space (Student Abstract) @@ -0,0 +1 @@ +Detecting out-of-distribution (OOD) samples is crucial for robust NLP models. Recent works observe two OOD types: background shifts (style change) and semantic shifts (content change), but existing detection methods vary in effectiveness for each type. To this end, we propose Meta-Crafting, a unified OOD detection method by constructing a new discriminative feature space utilizing 7 model-driven metadata chosen empirically that well detects both types of shifts. Our experimental results demonstrate state-of-the-art robustness to both shifts and significantly improved detection on stress datasets. \ No newline at end of file diff --git a/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables b/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables new file mode 100644 index 0000000000..2cb5994be3 --- /dev/null +++ b/data/2024/aaai/Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables @@ -0,0 +1 @@ +Designing suitable reward functions for numerous interacting intelligent agents is challenging in real-world applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs. \ No newline at end of file diff --git a/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems b/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems new file mode 100644 index 0000000000..d037e496af --- /dev/null +++ b/data/2024/aaai/Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems @@ -0,0 +1 @@ +This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The state-of-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system parameters vary over time, the NLF methods fail to adapt to such changes and may lead to conservative stability assessment performance. We circumvent this issue by integrating Model Agnostic Meta-learning (MAML) with NLFs and propose meta-NLFs. In this process, we train a meta-function that adapts to any parametric shifts and updates into an NLF for the system with new test-time parameter values. We demonstrate the stability assessment performance of meta-NLFs on some standard benchmark autonomous dynamical systems. \ No newline at end of file diff --git a/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering b/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering new file mode 100644 index 0000000000..4033bb77e5 --- /dev/null +++ b/data/2024/aaai/Meta-Reinforcement Learning via Exploratory Task Clustering @@ -0,0 +1 @@ +Meta-reinforcement learning (meta-RL) aims to quickly solve new RL tasks by leveraging knowledge from prior tasks. Previous studies often assume a single-mode homogeneous task distribution, ignoring possible structured heterogeneity among tasks. Such an oversight can hamper effective exploration and adaptation, especially with limited samples. In this work, we harness the structured heterogeneity among tasks via clustering to improve meta-RL, which facilitates knowledge sharing at the cluster level. To facilitate exploration, we also develop a dedicated cluster-level exploratory policy to discover task clusters via divide-and-conquer. The knowledge from the discovered clusters helps to narrow the search space of task-specific policy learning, leading to more sample-efficient policy adaptation. We evaluate the proposed method on environments with parametric clusters (e.g., rewards and state dynamics in the MuJoCo suite) and non-parametric clusters (e.g., control skills in the Meta-World suite). The results demonstrate strong advantages of our solution against a set of representative meta-RL methods. \ No newline at end of file diff --git a/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components b/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components new file mode 100644 index 0000000000..dcbc9ce15e --- /dev/null +++ b/data/2024/aaai/MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components @@ -0,0 +1 @@ +Meta-Reinforcement Learning (Meta-RL) aims to reveal shared characteristics in dynamics and reward functions across diverse training tasks. This objective is achieved by meta-learning a policy that is conditioned on task representations with encoded trajectory data or context, thus allowing rapid adaptation to new tasks from a known task distribution. However, since the trajectory data generated by the policy may be biased, the task inference module tends to form spurious correlations between trajectory data and specific tasks, thereby leading to poor adaptation to new tasks. To address this issue, we propose the Meta-RL with task unCertAinty feedback through decoupled context-aware Reward and Dynamics components (MetaCARD). MetaCARD distinctly decouples the dynamics and rewards when inferring tasks and integrates task uncertainty feedback from policy evaluation into the task inference module. This design effectively reduces uncertainty in tasks with changes in dynamics or/and reward functions, thereby enabling accurate task identification and adaptation. The experiment results on both Meta-World and classical MuJoCo benchmarks show that MetaCARD significantly outperforms prevailing Meta-RL baselines, demonstrating its remarkable adaptation ability in sophisticated environments that involve changes in both reward functions and dynamics. \ No newline at end of file diff --git a/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning b/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning new file mode 100644 index 0000000000..eff4d882bf --- /dev/null +++ b/data/2024/aaai/MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning @@ -0,0 +1 @@ +Equipping a deep model the ability of few-shot learning (FSL) is a core challenge for artificial intelligence. Gradient-based meta-learning effectively addresses the challenge by learning how to learn novel tasks. Its key idea is learning a deep model in a bi-level optimization manner, where the outer-loop process learns a shared gradient descent algorithm (called meta-optimizer), while the inner-loop process leverages it to optimize a task-specific base learner with few examples. Although these methods have shown superior performance on FSL, the outer-loop process requires calculating second-order derivatives along the inner-loop path, which imposes considerable memory burdens and the risk of vanishing gradients. This degrades meta-learning performance. Inspired by recent diffusion models, we find that the inner-loop gradient descent process can be viewed as a reverse process (i.e., denoising) of diffusion where the target of denoising is the weight of base learner but origin data. Based on this fact, we propose to model the gradient descent algorithm as a diffusion model and then present a novel conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of base learner weights from Gaussian initialization to target weights in a denoising manner. Thanks to the training efficiency of diffusion models, our MetaDiff does not need to differentiate through the inner-loop path such that the memory burdens and the risk of vanishing gradients can be effectively alleviated for improving FSL. Experimental results show that our MetaDiff outperforms state-of-the-art gradient-based meta-learning family on FSL tasks. \ No newline at end of file diff --git a/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization b/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization new file mode 100644 index 0000000000..1b93d61835 --- /dev/null +++ b/data/2024/aaai/MetaMix: Meta-State Precision Searcher for Mixed-Precision Activation Quantization @@ -0,0 +1 @@ +Mixed-precision quantization of efficient networks often suffer from activation instability encountered in the exploration of bit selections. To address this problem, we propose a novel method called MetaMix which consists of bit selection and weight training phases. The bit selection phase iterates two steps, (1) the mixed-precision-aware weight update, and (2) the bit-search training with the fixed mixed-precision-aware weights, both of which combined reduce activation instability in mixed-precision quantization and contribute to fast and high-quality bit selection. The weight training phase exploits the weights and step sizes trained in the bit selection phase and fine-tunes them thereby offering fast training. Our experiments with efficient and hard-to-quantize networks, i.e., MobileNet v2 and v3, and ResNet-18 on ImageNet show that our proposed method pushes the boundary of mixed-precision quantization, in terms of accuracy vs. operations, by outperforming both mixed- and single-precision SOTA methods. \ No newline at end of file diff --git a/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity b/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity new file mode 100644 index 0000000000..61bf46d0e3 --- /dev/null +++ b/data/2024/aaai/MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity @@ -0,0 +1 @@ +In recent years, the discovery of brain effective connectivity (EC) networks through computational analysis of functional magnetic resonance imaging (fMRI) data has gained prominence in neuroscience and neuroimaging. However, owing to the influence of diverse factors during data collection and processing, fMRI data typically exhibits high noise and limited sample characteristics, consequently leading to suboptimal performance of current methods. In this paper, we propose a novel brain effective connectivity discovery method based on meta-reinforcement learning, called MetaRLEC. The method mainly consists of three modules: actor, critic, and meta-critic. MetaRLEC first employs an encoder-decoder framework: the encoder utilizing a Transformer, converts noisy fMRI data into a state embedding; the decoder employing bidirectional LSTM, discovers brain region dependencies from the state and generates actions (EC networks). Then a critic network evaluates these actions, incentivizing the actor to learn higher-reward actions amidst the high-noise setting. Finally, a meta-critic framework facilitates online learning of historical state-action pairs, integrating an action-value neural network and supplementary training losses to enhance the model's adaptability to small-sample fMRI data. We conduct comprehensive experiments on both simulated and real-world data to demonstrate the efficacy of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation b/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation new file mode 100644 index 0000000000..88f810f176 --- /dev/null +++ b/data/2024/aaai/Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation @@ -0,0 +1 @@ +Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called Mimic to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/ \ No newline at end of file diff --git a/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition b/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition new file mode 100644 index 0000000000..0d1ea94b72 --- /dev/null +++ b/data/2024/aaai/Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition @@ -0,0 +1 @@ +Motor skills, especially fine motor skills like handwriting, play an essential role in academic pursuits and everyday life. Traditional methods to teach these skills, although effective, can be time-consuming and inconsistent. With the rise of advanced technologies like robotics and artificial intelligence, there is increasing interest in automating such teaching processes. In this study, we examine the potential of a virtual AI teacher in emulating the techniques of human educators for motor skill acquisition. We introduce an AI teacher model that captures the distinct characteristics of human instructors. Using a reinforcement learning environment tailored to mimic teacher-learner interactions, we tested our AI model against four guiding hypotheses, emphasizing improved learner performance, enhanced rate of skill acquisition, and reduced variability in learning outcomes. Our findings, validated on synthetic learners, revealed significant improvements across all tested hypotheses. Notably, our model showcased robustness across different learners and settings and demonstrated adaptability to handwriting. This research underscores the potential of integrating Imitation and Reinforcement Learning models with robotics in revolutionizing the teaching of critical motor skills. \ No newline at end of file diff --git a/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models b/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models new file mode 100644 index 0000000000..1cc0100ad7 --- /dev/null +++ b/data/2024/aaai/MindMap: Constructing Evidence Chains for Multi-Step Reasoning in Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, they still face significant challenges in automated reasoning, particularly in scenarios involving multi-step reasoning. In this paper, we focus on the logical reasoning problem. The main task is to answer a question based on a set of available facts and rules. A lot of work has focused on guiding LLMs to think logically by generating reasoning paths, ignoring the structure among available facts. In this paper, we propose a simple approach MindMap by introducing evidence chains for supporting reasoning. An evidence chain refers to a set of facts that involve the same subject. In this way, we can organize related facts together to avoid missing important information. MindMap can be integrated with existing reasoning framework, such as Chain-of-Thought (CoT) and Selection-Inference (SI), by letting the model select relevant evidence chains instead of independent facts. The experimental results on the bAbI and ProofWriterOWA datasets demonstrate the effectiveness of MindMap.It can significantly improve CoT and SI, especially in multi-step reasoning tasks. \ No newline at end of file diff --git a/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery b/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery new file mode 100644 index 0000000000..d64e0396fb --- /dev/null +++ b/data/2024/aaai/MineObserver 2.0: A Deep Learning & In-Game Framework for Assessing Natural Language Descriptions of Minecraft Imagery @@ -0,0 +1 @@ +MineObserver 2.0 is an AI framework that uses Computer Vision and Natural Language Processing for assessing the accuracy of learner-generated descriptions of Minecraft images that include some scientifically relevant content. The system automatically assesses the accuracy of participant observations, written in natural language, made during science learning activities that take place in Minecraft. We demonstrate our system working in real-time and describe a teacher dashboard to showcase observations, both of which advance our previous work. We present the results of a study showing that MineObserver 2.0 improves over its predecessor both in perceived accuracy of the system's generated descriptions as well as in usefulness of the system's feedback. In future work, we intend improve system generated descriptions to give more teacher control and shift the system to perform continuous learning to more rapidly respond to novel observations made by learners. \ No newline at end of file diff --git a/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization b/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization new file mode 100644 index 0000000000..4fcdf84d1c --- /dev/null +++ b/data/2024/aaai/Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization @@ -0,0 +1 @@ +We present a new zero-order optimization method called Minibatch Stochastic Three Points (MiSTP), specifically designed to solve stochastic unconstrained minimization problems when only an approximate evaluation of the objective function is possible. MiSTP is an extension of the Stochastic Three Point Method (STP). The key innovation of MiSTP is that it selects the next point solely based on the objective function approximation, without relying on its exact evaluation. At each iteration, MiSTP generates a random search direction and compares the approximations of the objective function at the current point, the randomly generated direction and its opposite. The best of these three points is chosen as the next iterate. We analyze the worst-case complexity of MiSTP in the convex and non-convex cases and demonstrate that it matches the most accurate complexity bounds known in the literature for zero-order optimization methods. We perform extensive numerical evaluations to assess the computational efficiency of MiSTP and compare its performance to other state-of-the-art methods by testing it on several machine learning tasks. The results show that MiSTP outperforms or has comparable performance against state-of-the-art methods indicating its potential for a wide range of practical applications. \ No newline at end of file diff --git a/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond) b/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond) new file mode 100644 index 0000000000..4dbd332a24 --- /dev/null +++ b/data/2024/aaai/Minimal Macro-Based Rewritings of Formal Languages: Theory and Applications in Ontology Engineering (and Beyond) @@ -0,0 +1 @@ +In this paper, we introduce the problem of rewriting finite formal languages using syntactic macros such that the rewriting is minimal in size. We present polynomial-time algorithms to solve variants of this problem and show their correctness. To demonstrate the practical relevance of the proposed problems and the feasibility and effectiveness of our algorithms in practice, we apply these to biomedical ontologies authored in OWL. We find that such rewritings can significantly reduce the size of ontologies by capturing repeated expressions with macros. This approach not only offers valuable assistance in enhancing ontology quality and comprehension but can also be seen as a general methodology for evaluating features of rewriting systems (including syntactic macros, templates, or other forms of rewriting rules), which can be analyzed in terms of their influence on computational problems. \ No newline at end of file diff --git a/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents b/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents new file mode 100644 index 0000000000..c30da6d043 --- /dev/null +++ b/data/2024/aaai/Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents @@ -0,0 +1 @@ +Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristic-based diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies. \ No newline at end of file diff --git a/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training b/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training new file mode 100644 index 0000000000..e44c0f29e7 --- /dev/null +++ b/data/2024/aaai/Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training @@ -0,0 +1 @@ +Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap. \ No newline at end of file diff --git a/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis b/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis new file mode 100644 index 0000000000..f00308af1f --- /dev/null +++ b/data/2024/aaai/Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis @@ -0,0 +1 @@ +Obtaining large-scale radiology reports can be difficult for medical images due to ethical concerns, limiting the effectiveness of contrastive pre-training in the medical image domain and underscoring the need for alternative methods. In this paper, we propose eye-tracking as an alternative to text reports, as it allows for the passive collection of gaze signals without ethical issues. By tracking the gaze of radiologists as they read and diagnose medical images, we can understand their visual attention and clinical reasoning. When a radiologist has similar gazes for two medical images, it may indicate semantic similarity for diagnosis, and these images should be treated as positive pairs when pre-training a computer-assisted diagnosis (CAD) network through contrastive learning. Accordingly, we introduce the Medical contrastive Gaze Image Pre-training (McGIP) as a plug-and-play module for contrastive learning frameworks. McGIP uses radiologist gaze to guide contrastive pre-training. We evaluate our method using two representative types of medical images and two common types of gaze data. The experimental results demonstrate the practicality of McGIP, indicating its high potential for various clinical scenarios and applications. \ No newline at end of file diff --git a/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension b/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension new file mode 100644 index 0000000000..4a1fdda2ad --- /dev/null +++ b/data/2024/aaai/Mitigating Idiom Inconsistency: A Multi-Semantic Contrastive Learning Method for Chinese Idiom Reading Comprehension @@ -0,0 +1 @@ +Chinese idioms pose a significant challenge for machine reading comprehension due to their metaphorical meanings often diverging from their literal counterparts, leading to metaphorical inconsistency. Furthermore, the same idiom can have different meanings in different contexts, resulting in contextual inconsistency. Although deep learning-based methods have achieved some success in idioms reading comprehension, existing approaches still struggle to accurately capture idiom representations due to metaphorical inconsistency and contextual inconsistency of idioms. To address these challenges, we propose a novel model, Multi-Semantic Contrastive Learning Method (MSCLM), which simultaneously addresses metaphorical inconsistency and contextual inconsistency of idioms. To mitigate metaphorical inconsistency, we propose a metaphor contrastive learning module based on the prompt method, bridging the semantic gap between literal and metaphorical meanings of idioms. To mitigate contextual inconsistency, we propose a multi-semantic cross-attention module to explore semantic features between different metaphors of the same idiom in various contexts. Our model has been compared with multiple current latest models (including GPT-3.5) on multiple Chinese idiom reading comprehension datasets, and the experimental results demonstrate that MSCLM outperforms state-of-the-art models. \ No newline at end of file diff --git a/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning b/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning new file mode 100644 index 0000000000..20bd59d581 --- /dev/null +++ b/data/2024/aaai/Mitigating Label Bias in Machine Learning: Fairness through Confident Learning @@ -0,0 +1 @@ +Discrimination can occur when the underlying unbiased labels are overwritten by an agent with potential bias, resulting in biased datasets that unfairly harm specific groups and cause classifiers to inherit these biases. In this paper, we demonstrate that despite only having access to the biased labels, it is possible to eliminate bias by filtering the fairest instances within the framework of confident learning. In the context of confident learning, low self-confidence usually indicates potential label errors; however, this is not always the case. Instances, particularly those from underrepresented groups, might exhibit low confidence scores for reasons other than labeling errors. To address this limitation, our approach employs truncation of the confidence score and extends the confidence interval of the probabilistic threshold. Additionally, we incorporate with co-teaching paradigm for providing a more robust and reliable selection of fair instances and effectively mitigating the adverse effects of biased labels. Through extensive experimentation and evaluation of various datasets, we demonstrate the efficacy of our approach in promoting fairness and reducing the impact of label bias in machine learning models. \ No newline at end of file diff --git a/data/2024/aaai/Mitigating Label Noise through Data Ambiguation b/data/2024/aaai/Mitigating Label Noise through Data Ambiguation new file mode 100644 index 0000000000..a4396bf735 --- /dev/null +++ b/data/2024/aaai/Mitigating Label Noise through Data Ambiguation @@ -0,0 +1 @@ +Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels. \ No newline at end of file diff --git a/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting b/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting new file mode 100644 index 0000000000..519fb3f5d8 --- /dev/null +++ b/data/2024/aaai/Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting @@ -0,0 +1 @@ +Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graph-based Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs. \ No newline at end of file diff --git a/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization b/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization new file mode 100644 index 0000000000..0f71831e89 --- /dev/null +++ b/data/2024/aaai/Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization @@ -0,0 +1 @@ +In open-domain Question Answering (QA), dense text retrieval is crucial for finding relevant passages to generate answers. Typically, contrastive learning is used to train a retrieval model, which maps passages and queries to the same semantic space, making similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative problem, where relevant passages may be missed during data annotation. Hard negative sampling, commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those close to a given query, and thus more likely to be false negatives. To address this, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used contrastive loss. Our analysis shows that the regularizer helps make the dense retrieval model more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems. \ No newline at end of file diff --git a/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion b/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion new file mode 100644 index 0000000000..e23e127cc0 --- /dev/null +++ b/data/2024/aaai/Mixed Geometry Message and Trainable Convolutional Attention Network for Knowledge Graph Completion @@ -0,0 +1 @@ +Knowledge graph completion (KGC) aims to study the embedding representation to solve the incompleteness of knowledge graphs (KGs). Recently, graph convolutional networks (GCNs) and graph attention networks (GATs) have been widely used in KGC tasks by capturing neighbor information of entities. However, Both GCNs and GATs based KGC models have their limitations, and the best method is to analyze the neighbors of each entity (pre-validating), while this process is prohibitively expensive. Furthermore, the representation quality of the embeddings can affect the aggregation of neighbor information (message passing). To address the above limitations, we propose a novel knowledge graph completion model with mixed geometry message and trainable convolutional attention network named MGTCA. Concretely, the mixed geometry message function generates rich neighbor message by integrating spatially information in the hyperbolic space, hypersphere space and Euclidean space jointly. To complete the autonomous switching of graph neural networks (GNNs) and eliminate the necessity of pre-validating the local structure of KGs, a trainable convolutional attention network is proposed by comprising three types of GNNs in one trainable formulation. Furthermore, a mixed geometry scoring function is proposed, which calculates scores of triples by novel prediction function and similarity function based on different geometric spaces. Extensive experiments on three standard datasets confirm the effectiveness of our innovations, and the performance of MGTCA is significantly improved compared to the state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Mixed-Effects Contextual Bandits b/data/2024/aaai/Mixed-Effects Contextual Bandits new file mode 100644 index 0000000000..1e0e24a4e5 --- /dev/null +++ b/data/2024/aaai/Mixed-Effects Contextual Bandits @@ -0,0 +1 @@ +We study a novel variant of a contextual bandit problem with multi-dimensional reward feedback formulated as a mixed-effects model, where the correlations between multiple feedback are induced by sharing stochastic coefficients called random effects. We propose a novel algorithm, Mixed-Effects Contextual UCB (ME-CUCB), achieving tildeO(d sqrt(mT)) regret bound after T rounds where d is the dimension of contexts and m is the dimension of outcomes, with either known or unknown covariance structure. This is a tighter regret bound than that of the naive canonical linear bandit algorithm ignoring the correlations among rewards. We prove a lower bound of Omega(d sqrt(mT)) matching the upper bound up to logarithmic factors. To our knowledge, this is the first work providing a regret analysis for mixed-effects models and algorithms involving weighted least-squares estimators. Our theoretical analysis faces a significant technical challenge in that the error terms do not constitute martingales since the weights depend on the rewards. We overcome this challenge by using covering numbers, of theoretical interest in its own right. We provide numerical experiments demonstrating the advantage of our proposed algorithm, supporting the theoretical claims. \ No newline at end of file diff --git a/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization b/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization new file mode 100644 index 0000000000..6576cd6833 --- /dev/null +++ b/data/2024/aaai/Mixup-Induced Domain Extrapolation for Domain Generalization @@ -0,0 +1 @@ +Domain generalization aims to learn a well-performed classifier on multiple source domains for unseen target domains under domain shift. Domain-invariant representation (DIR) is an intuitive approach and has been of great concern. In practice, since the targets are variant and agnostic, only a few sources are not sufficient to reflect the entire domain population, leading to biased DIR. Derived from PAC-Bayes framework, we provide a novel generalization bound involving the number of domains sampled from the environment (N) and the radius of the Wasserstein ball centred on the target (r), which have rarely been considered before. Herein, we can obtain two natural and significant findings: when N increases, 1) the gap between the source and target sampling environments can be gradually mitigated; 2) the target can be better approximated within the Wasserstein ball. These findings prompt us to collect adequate domains against domain shift. For seeking convenience, we design a novel yet simple Extrapolation Domain strategy induced by the Mixup scheme, namely EDM. Through a reverse Mixup scheme to generate the extrapolated domains, combined with the interpolated domains, we expand the interpolation space spanned by the sources, providing more abundant domains to increase sampling intersections to shorten r. Moreover, EDM is easy to implement and be plugged-and-played. In experiments, EDM has been plugged into several methods in both closed and open set settings, achieving up to 5.73% improvement. \ No newline at end of file diff --git a/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile b/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile new file mode 100644 index 0000000000..58c2292eb0 --- /dev/null +++ b/data/2024/aaai/MobileInst: Video Instance Segmentation on the Mobile @@ -0,0 +1 @@ +Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021. \ No newline at end of file diff --git a/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting b/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting new file mode 100644 index 0000000000..b8bf78c54d --- /dev/null +++ b/data/2024/aaai/ModWaveMLP: MLP-Based Mode Decomposition and Wavelet Denoising Model to Defeat Complex Structures in Traffic Forecasting @@ -0,0 +1 @@ +Traffic prediction is the core issue of Intelligent Transportation Systems. Recently, researchers have tended to use complex structures, such as transformer-based structures, for tasks such as traffic prediction. Notably, traffic data is simpler to process compared to text and images, which raises questions about the necessity of these structures. Additionally, when handling traffic data, researchers tend to manually design the model structure based on the data features, which makes the structure of traffic prediction redundant and the model generalizability limited. To address the above, we introduce the ‘ModWaveMLP’—A multilayer perceptron (MLP) based model designed according to mode decomposition and wavelet noise reduction information learning concepts. The model is based on simple MLP structure, which achieves the separation and prediction of different traffic modes and does not depend on additional features introduced such as the topology of the traffic network. By performing experiments on real-world datasets METR-LA and PEMS-BAY, our model achieves SOTA, outperforms GNN and transformer-based models, and outperforms those that introduce additional feature data with better generalizability, and we further demonstrate the effectiveness of the various parts of the model through ablation experiments. This offers new insights to subsequent researchers involved in traffic model design. The code is available at: https://github.com/Kqingzheng/ModWaveMLP. \ No newline at end of file diff --git a/data/2024/aaai/Model AI Assignments 2024 b/data/2024/aaai/Model AI Assignments 2024 new file mode 100644 index 0000000000..6b19d0034e --- /dev/null +++ b/data/2024/aaai/Model AI Assignments 2024 @@ -0,0 +1,10 @@ +The Model AI Assignments session seeks to gather and dis- +seminate the best assignment designs of the Artificial In- +telligence (AI) Education community. Recognizing that as- +signments form the core of student learning experience, we + +here present abstracts of five AI assignments from the 2024 +session that are easily adoptable, playfully engaging, and + +flexible for a variety of instructor needs. Assignment spec- +ifications and supporting resources may be found at http://modelai.gettysburg.edu. \ No newline at end of file diff --git a/data/2024/aaai/Model Counting and Sampling via Semiring Extensions b/data/2024/aaai/Model Counting and Sampling via Semiring Extensions new file mode 100644 index 0000000000..74c2310906 --- /dev/null +++ b/data/2024/aaai/Model Counting and Sampling via Semiring Extensions @@ -0,0 +1 @@ +Many decision and optimization problems have natural extensions as counting problems. The best known example is the Boolean satisfiability problem (SAT), where we want to count the satisfying assignments of truth values to the variables, which is known as the #SAT problem. Likewise, for discrete optimization problems, we want to count the states on which the objective function attains the optimal value. Both SAT and discrete optimization can be formulated as selective marginalize a product function (MPF) queries. Here, we show how general selective MPF queries can be extended for model counting. MPF queries are encoded as tensor hypernetworks over suitable semirings that can be solved by generic tensor hypernetwork contraction algorithms. Our model counting extension is again an MPF query, on an extended semiring, that can be solved by the same contraction algorithms. Model counting is required for uniform model sampling. We show how the counting extension can be further extended for model sampling by constructing yet another semiring. We have implemented the model counting and sampling extensions. Experiments show that our generic approach is competitive with the state of the art in model counting and model sampling. \ No newline at end of file diff --git a/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning b/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning new file mode 100644 index 0000000000..65c584939d --- /dev/null +++ b/data/2024/aaai/Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning @@ -0,0 +1 @@ +In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper provides an overview of model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation of the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities. \ No newline at end of file diff --git a/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction b/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction new file mode 100644 index 0000000000..e65ff3da7b --- /dev/null +++ b/data/2024/aaai/Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction @@ -0,0 +1 @@ +Aspect prediction (AP) and sentiment prediction (SP) are representative applications in fine-grained sentiment anal- ysis. They can be considered as sequential tasks, where AP identifies mentioned aspects in a sentence, and SP infers fine-grained sentiments for these aspects. Recent models perform the aspect-sentiment prediction in a joint man-ner, but heavily rely on the feature interactions of aspect and sentiment. One drawback is that they ignore correlation strength varies between aspect features and sentiment fea- tures across different sentences, and employ a fixed feature interaction strategy may limit effective knowledge transfer across tasks. To tackle this issue, in this paper, we propose an Adaptive Inter-task Feature Interaction framework, AIFI, for joint aspect-sentiment prediction. Specifically, we introduce a novel contrast-based alignment method based on contrastive learning. Our approach considers the AP-specific and SP-specific representations of a given sentence as a positive pair, while representation of another random sentence serves as a negative example. Moreover, we propose an inter-task feature correlation network to predict the contrast strength, which is determined by the temperature coefficient in the InfoNCE loss. This dynamic correlation adjustment enhances model’s ability to capture proper feature interactions more efficiently. Experimental results on three datasets validate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning b/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning new file mode 100644 index 0000000000..c39231331e --- /dev/null +++ b/data/2024/aaai/Modeling Knowledge Graphs with Composite Reasoning @@ -0,0 +1,3 @@ +The ability to combine multiple pieces of existing knowledge to infer new knowledge is both crucial and challenging. In this paper, we explore how facts of various entities are combined in the context of knowledge graph completion (KGC). We use composite reasoning to unify the views from different KGC models, including translational models, tensor factorization (TF)-based models, instance-based learning models, and KGC regularizers. + +Moreover, our comprehensive examination of composite reasoning revealed an unexpected phenomenon: certain TF-based models learn embeddings with erroneous composite reasoning, which ultimately violates their fundamental collaborative filtering assumption and reduces their effects. This motivates us to reduce their composition error. Empirical evaluations demonstrate that mitigating the composition risk not only enhances the performance of TF-based models across all tested settings, but also surpass or is competitive with the state-of-the-art performance on two out of four benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep b/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep new file mode 100644 index 0000000000..2b54a3f687 --- /dev/null +++ b/data/2024/aaai/Modeling Stereo-Confidence out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep @@ -0,0 +1,7 @@ +We propose a novel stereo-confidence that can be measured externally to various stereo-matching networks, offering an alternative input modality choice of the cost volume for learning-based approaches, especially in safety-critical systems. +Grounded in the foundational concepts of disparity definition and the disparity plane sweep, the proposed stereo-confidence method is built upon the idea that any shift in a stereo-image pair should be updated in a corresponding amount shift in the disparity map. +Based on this idea, the proposed stereo-confidence method can be summarized in three folds. +1) Using the disparity plane sweep, multiple disparity maps can be obtained and treated as a 3-D volume (predicted disparity volume), like the cost volume is constructed. +2) One of these disparity maps serves as an anchor, allowing us to define a desirable (or ideal) disparity profile at every spatial point. +3) By comparing the desirable and predicted disparity profiles, we can quantify the level of matching ambiguity between left and right images for confidence measurement. +Extensive experimental results using various stereo-matching networks and datasets demonstrate that the proposed stereo-confidence method not only shows competitive performance on its own but also consistent performance improvements when it is used as an input modality for learning-based stereo-confidence methods. \ No newline at end of file diff --git a/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks b/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks new file mode 100644 index 0000000000..6ba2c53000 --- /dev/null +++ b/data/2024/aaai/Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks @@ -0,0 +1 @@ +Confidence calibration in Graph Neural Networks (GNNs) aims to align a model's predicted confidence with its actual accuracy. Recent studies have indicated that GNNs exhibit an under-confidence bias, which contrasts the over-confidence bias commonly observed in deep neural networks. However, our deeper investigation into this topic reveals that not all GNNs exhibit this behavior. Upon closer examination of message passing in GNNs, we found a clear link between message aggregation and confidence levels. Specifically, GNNs with extensive message aggregation, often seen in deep architectures or when leveraging large amounts of labeled data, tend to exhibit overconfidence. This overconfidence can be attributed to factors like over-learning and over-smoothing. Conversely, GNNs with fewer layers, known for their balanced message passing and superior node representation, may exhibit under-confidence. To counter these confidence biases, we introduce the Adaptive Unified Label Smoothing (AU-LS) technique. Our experiments show that AU-LS outperforms existing methods, addressing both over and under-confidence in various GNN scenarios. \ No newline at end of file diff --git a/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts b/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts new file mode 100644 index 0000000000..5bbc2eea99 --- /dev/null +++ b/data/2024/aaai/MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts @@ -0,0 +1 @@ +Deep learning is now widely used in drug discovery, providing significant acceleration and cost reduction. As the most fundamental building block, molecular representation is essential for predicting molecular properties to enable various downstream applications. Most existing methods attempt to incorporate more information to learn better representations. However, not all features are equally important for a specific task. Ignoring this would potentially compromise the training efficiency and predictive accuracy. To address this issue, we propose a novel approach, which treats language models as an agent and molecular pretraining models as a knowledge base. The agent accentuates task-relevant features in the molecular representation by understanding the natural language description of the task, just as a tailor customizes clothes for clients. Thus, we call this approach MolTailor. Evaluations demonstrate MolTailor's superior performance over baselines, validating the efficacy of enhancing relevance for molecular representation learning. This illustrates the potential of language model guided optimization to better exploit and unleash the capabilities of existing powerful molecular representation methods. Our code and appendix are available at https://github.com/SCIR-HI/MolTailor. \ No newline at end of file diff --git a/data/2024/aaai/Molecular Optimization Model with Patentability Constraint b/data/2024/aaai/Molecular Optimization Model with Patentability Constraint new file mode 100644 index 0000000000..27f8c27827 --- /dev/null +++ b/data/2024/aaai/Molecular Optimization Model with Patentability Constraint @@ -0,0 +1,5 @@ +In drug development, molecular optimization is a crucial challenge that involves generating novel molecules given a lead molecule as input. The task requires maintaining molecular similarity to the original molecule while simultaneously optimizing multiple chemical attributes. To aid in this process, numerous generative models have been proposed. +However, in practical applications, it is crucial for these models not only to generate novel molecules with the above constraints but also to generate molecules that significantly differ from any existing patented compounds. +In this work, we present a multi-optimization molecular framework to address this challenge. +Our framework trains a model to prioritize both enhanced properties and substantial dissimilarity from patented compounds. By jointly learning continuous representations of optimized and patentable molecules, we ensure that the generated molecules are significantly distant from any patented compounds while improving chemical properties. +Through empirical evaluation, we demonstrate the superior performance of our approach compared to state-of-the-art molecular optimization methods both in chemical property optimization and patentability. \ No newline at end of file diff --git a/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) b/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) new file mode 100644 index 0000000000..63b15ebc4c --- /dev/null +++ b/data/2024/aaai/Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) @@ -0,0 +1 @@ +This paper investigates runtime monitoring of perception systems. Perception is a critical component of high-integrity applications of robotics and autonomous systems, such as self-driving cars. In these applications, failure of perception systems may put human life at risk, and a broad adoption of these technologies requires the development of methodologies to guarantee and monitor safe operation. Despite the paramount importance of perception, currently there is no formal approach for system-level perception monitoring. In this paper, we formalize the problem of runtime fault detection and identification in perception systems and present a framework to model diagnostic information using a diagnostic graph. We then provide a set of deterministic, probabilistic, and learning-based algorithms that use diagnostic graphs to perform fault detection and identification. Moreover, we investigate fundamental limits and provide deterministic and probabilistic guarantees on the fault detection and identification results. We conclude the paper with an extensive experimental evaluation, which recreates several realistic failure modes in the LGSVL open-source autonomous driving simulator, and applies the proposed system monitors to a state-of-the-art autonomous driving software stack (Baidu's Apollo Auto). The results show that the proposed system monitors outperform baselines, have the potential of preventing accidents in realistic autonomous driving scenarios, and incur a negligible computational overhead. \ No newline at end of file diff --git a/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images b/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images new file mode 100644 index 0000000000..d3c2f5bd39 --- /dev/null +++ b/data/2024/aaai/Mono3DVG: 3D Visual Grounding in Monocular Images @@ -0,0 +1 @@ +We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released. \ No newline at end of file diff --git a/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation b/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation new file mode 100644 index 0000000000..9e28917c4b --- /dev/null +++ b/data/2024/aaai/Monocular 3D Hand Mesh Recovery via Dual Noise Estimation @@ -0,0 +1 @@ +Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a single-view image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10% but also achieves state-of-the-art performance. Project page: https://github.com/hanhuili/DNE4Hand. \ No newline at end of file diff --git a/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty b/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty new file mode 100644 index 0000000000..0ceb7dbbf3 --- /dev/null +++ b/data/2024/aaai/Monte Carlo Tree Search in the Presence of Transition Uncertainty @@ -0,0 +1 @@ +Monte Carlo Tree Search (MCTS) is an immensely popular search-based framework used for decision making. It is traditionally applied to domains where a perfect simulation model of the environment is available. We study and improve MCTS in the context where the environment model is given but imperfect. We show that the discrepancy between the model and the actual environment can lead to significant performance degradation with standard MCTS. We therefore develop Uncertainty Adapted MCTS (UA-MCTS), a more robust algorithm within the MCTS framework. We estimate the transition uncertainty in the given model, and direct the search towards more certain transitions in the state space. We modify all four MCTS phases to improve the search behavior by considering these estimates. We prove, in the corrupted bandit case, that adding uncertainty information to adapt UCB leads to tighter regret bound than standard UCB. Empirically, we evaluate UA-MCTS and its individual components on the deterministic domains from the MinAtar test suite. Our results demonstrate that UA-MCTS strongly improves MCTS in the presence of model transition errors. \ No newline at end of file diff --git a/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism b/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism new file mode 100644 index 0000000000..0d3d76bbe4 --- /dev/null +++ b/data/2024/aaai/Moral Uncertainty and the Problem of Fanaticism @@ -0,0 +1 @@ +While there is universal agreement that agents ought to act ethically, there is no agreement as to what constitutes ethical behaviour. To address this problem, recent philosophical approaches to `moral uncertainty' propose aggregation of multiple ethical theories to guide agent behaviour. However, one of the foundational proposals for aggregation - Maximising Expected Choiceworthiness (MEC) - has been criticised as being vulnerable to fanaticism; the problem of an ethical theory dominating agent behaviour despite low credence (confidence) in said theory. Fanaticism thus undermines the `democratic' motivation for accommodating multiple ethical perspectives. The problem of fanaticism has not yet been mathematically defined. Representing moral uncertainty as an instance of social welfare aggregation, this paper contributes to the field of moral uncertainty by 1) formalising the problem of fanaticism as a property of social welfare functionals and 2) providing non-fanatical alternatives to MEC, i.e. Highest k-trimmed Mean and Highest Median. \ No newline at end of file diff --git a/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders b/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders new file mode 100644 index 0000000000..97181b85b3 --- /dev/null +++ b/data/2024/aaai/MorphVAE: Advancing Morphological Design of Voxel-Based Soft Robots with Variational Autoencoders @@ -0,0 +1 @@ +Soft robot design is an intricate field with unique challenges due to its complex and vast search space. In the past literature, evolutionary computation algorithms, including novel probabilistic generative models (PGMs), have shown potential in this realm. However, these methods are sample inefficient and predominantly focus on rigid robots in locomotion tasks, which limit their performance and application in robot design automation. In this work, we propose MorphVAE, an innovative PGM that incorporates a multi-task training scheme and a meticulously crafted sampling technique termed ``continuous natural selection'', aimed at bolstering sample efficiency. This method empowers us to gain insights from assessed samples across diverse tasks and temporal evolutionary stages, while simultaneously maintaining a delicate balance between optimization efficiency and biodiversity. Through extensive experiments in various locomotion and manipulation tasks, we substantiate the efficiency of MorphVAE in generating high-performing and diverse designs, surpassing the performance of competitive baselines. \ No newline at end of file diff --git a/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events b/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events new file mode 100644 index 0000000000..4e7ed10997 --- /dev/null +++ b/data/2024/aaai/Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events @@ -0,0 +1 @@ +Motion deblurring can be advanced by exploiting informative features from supplementary sensors such as event cameras, which can capture rich motion information asynchronously with high temporal resolution. Existing event-based motion deblurring methods neither consider the modality redundancy in spatial fusion nor temporal cooperation between events and frames. To tackle these limitations, a novel spatial-temporal collaboration network (STCNet) is proposed for event-based motion deblurring. Firstly, we propose a differential-modality based cross-modal calibration strategy to suppress redundancy for complementarity enhancement, and then bimodal spatial fusion is achieved with an elaborate cross-modal co-attention mechanism to weight the contributions of them for importance balance. Besides, we present a frame-event mutual spatio-temporal attention scheme to alleviate the errors of relying only on frames to compute cross-temporal similarities when the motion blur is significant, and then the spatio-temporal features from both frames and events are aggregated with the custom cross-temporal coordinate attention. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/STCNet. \ No newline at end of file diff --git a/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators b/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators new file mode 100644 index 0000000000..d1b56cc611 --- /dev/null +++ b/data/2024/aaai/MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators @@ -0,0 +1 @@ +Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/. \ No newline at end of file diff --git a/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation b/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation new file mode 100644 index 0000000000..9a342b440e --- /dev/null +++ b/data/2024/aaai/MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation @@ -0,0 +1 @@ +Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial T-T* steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last T* steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. \ No newline at end of file diff --git a/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling b/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling new file mode 100644 index 0000000000..5bb05232e8 --- /dev/null +++ b/data/2024/aaai/MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling @@ -0,0 +1,2 @@ +Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. +This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released. \ No newline at end of file diff --git a/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing b/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing new file mode 100644 index 0000000000..d6abeab222 --- /dev/null +++ b/data/2024/aaai/MuST: Robust Image Watermarking for Multi-Source Tracing @@ -0,0 +1,2 @@ +In recent years, with the popularity of social media applications, massive digital images are available online, which brings great convenience to image recreation. However, the use of unauthorized image materials in multi-source composite images is still inadequately regulated, which may cause significant loss and discouragement to the copyright owners of the source image materials. Ideally, deep watermarking techniques could provide a solution for protecting these copyrights based on their encoder-noise-decoder training strategy. Yet existing image watermarking schemes, which are mostly designed for single images, cannot well address the copyright protection requirements in this scenario, since the multi-source image composing process commonly includes distortions that are not well investigated in previous methods, e.g., the extreme downsizing. +To meet such demands, we propose MuST, a multi-source tracing robust watermarking scheme, whose architecture includes a multi-source image detector and minimum external rectangle operation for multiple watermark resynchronization and extraction. Furthermore, we constructed an image material dataset covering common image categories and designed the simulation model of the multi-source image composing process as the noise layer. Experiments demonstrate the excellent performance of MuST in tracing sources of image materials from the composite images compared with SOTA watermarking methods, which could maintain the extraction accuracy above 98% to trace the sources of at least 3 different image materials while keeping the average PSNR of watermarked image materials higher than 42.51 dB. We released our code on https://github.com/MrCrims/MuST \ No newline at end of file diff --git a/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models b/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models new file mode 100644 index 0000000000..64ccd9c10c --- /dev/null +++ b/data/2024/aaai/Multi-Architecture Multi-Expert Diffusion Models @@ -0,0 +1 @@ +In this paper, we address the performance degradation of efficient diffusion models by introducing Multi-architecturE Multi-Expert diffusion models (MEME). We identify the need for tailored operations at different time-steps in diffusion processes and leverage this insight to create compact yet high-performing models. MEME assigns distinct architectures to different time-step intervals, balancing convolution and self-attention operations based on observed frequency characteristics. We also introduce a soft interval assignment strategy for comprehensive training. Empirically, MEME operates 3.3 times faster than baselines while improving image generation quality (FID scores) by 0.62 (FFHQ) and 0.37 (CelebA). Though we validate the effectiveness of assigning more optimal architecture per time-step, where efficient models outperform the larger models, we argue that MEME opens a new design choice for diffusion models that can be easily applied in other scenarios, such as large multi-expert models. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin b/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin new file mode 100644 index 0000000000..020f210c1c --- /dev/null +++ b/data/2024/aaai/Multi-Class Support Vector Machine with Maximizing Minimum Margin @@ -0,0 +1,3 @@ +Support Vector Machine (SVM) stands out as a prominent machine learning technique widely applied in practical pattern recognition tasks. It achieves binary classification by maximizing the "margin", which represents the minimum distance between instances and the decision boundary. Although many efforts have been dedicated to expanding SVM for multi-class case through strategies such as one versus one and one versus the rest, satisfactory solutions remain to be developed. In this paper, we propose a novel method for multi-class SVM that incorporates pairwise class loss considerations and maximizes the minimum margin. Adhering to this concept, we embrace a new formulation that imparts heightened flexibility to multi-class SVM. +Furthermore, the correlations between the proposed method and multiple forms of multi-class SVM are analyzed. The proposed regularizer, akin to the concept of "margin", can serve as a seamless enhancement over the softmax in deep learning, providing guidance for network parameter learning. Empirical evaluations demonstrate the effectiveness and superiority of our proposed +method over existing multi-classification methods. Complete version is available at https://arxiv.org/pdf/2312.06578.pdf. Code is available at https://github.com/zz-haooo/M3SVM. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization b/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization new file mode 100644 index 0000000000..2423f57ce4 --- /dev/null +++ b/data/2024/aaai/Multi-Constellation-Inspired Single-Shot Global LiDAR Localization @@ -0,0 +1 @@ +Global localization is a challenging task for intelligent robots, as its accuracy directly contributes to the performance of downstream navigation and planning tasks. However, existing literature focus more on the place retrieval and the success rate of localization, with limited attention given to the metrics of position estimation. In this paper, a single-shot global LiDAR localization method is proposed with the ultimate goal of achieving high position accuracy, inspired by the positioning approach of multi-constellation localization systems. Initially, we perform coarse localization using global descriptors and select observation points along with their corresponding coordinates based on the obtained coarse localization results. Coordinates can be acquired from a pre-built map, GNSS, or other devices. Then, a lightweight LiDAR odometry method is designed to estimate the distance between the retrieved data and the observation points. Ultimately, the localization problem is transformed into an optimization problem of solving a system of multiple sphere equations. The experimental results on the KITTI dataset and the self-collected dataset demonstrate that our method achieves an average localization error (including errors in the z-axis) of 0.89 meters. In addition, it achieves retrieval efficiency of 0.357 s per frame on the former dataset and 0.214 s per frame on the latter one. Code and data are available at https://github.com/jlurobot/multi-constellation-localization. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing b/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing new file mode 100644 index 0000000000..43791c9286 --- /dev/null +++ b/data/2024/aaai/Multi-Cross Sampling and Frequency-Division Reconstruction for Image Compressed Sensing @@ -0,0 +1 @@ +Deep Compressed Sensing (DCS) has attracted considerable interest due to its superior quality and speed compared to traditional CS algorithms. However, current approaches employ simplistic convolutional downsampling to acquire measurements, making it difficult to retain high-level features of the original signal for better image reconstruction. Furthermore, these approaches often overlook the presence of both high- and low-frequency information within the network, despite their critical role in achieving high-quality reconstruction. To address these challenges, we propose a novel Multi-Cross Sampling and Frequency Division Network (MCFD-Net) for image CS. The Dynamic Multi-Cross Sampling (DMCS) module, a sampling network of MCFD-Net, incorporates pyramid cross convolution and dual-branch sampling with multi-level pooling. Additionally, it introduces an attention mechanism between perception blocks to enhance adaptive learning effects. In the second deep reconstruction stage, we design a Frequency Division Reconstruction Module (FDRM). This module employs a discrete wavelet transform to extract high- and low-frequency information from images. It then applies multi-scale convolution and self-similarity attention compensation separately to both types of information before merging the output reconstruction results. The MCFD-Net integrates the DMCS and FDRM to construct an end-to-end learning network. Extensive CS experiments conducted on multiple benchmark datasets demonstrate that our MCFD-Net outperforms state-of-the-art approaches, while also exhibiting superior noise robustness. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Dimensional Fair Federated Learning b/data/2024/aaai/Multi-Dimensional Fair Federated Learning new file mode 100644 index 0000000000..5f2900771a --- /dev/null +++ b/data/2024/aaai/Multi-Dimensional Fair Federated Learning @@ -0,0 +1 @@ +Federated learning (FL) has emerged as a promising collaborative and secure paradigm for training a model from decentralized data without compromising privacy. Group fairness and client fairness are two dimensions of fairness that are important for FL. Standard FL can result in disproportionate disadvantages for certain clients, and it still faces the challenge of treating different groups equitably in a population. The problem of privately training fair FL models without compromising the generalization capability of disadvantaged clients remains open. In this paper, we propose a method, called mFairFL, to address this problem and achieve group fairness and client fairness simultaneously. mFairFL leverages differential multipliers to construct an optimization objective for empirical risk minimization with fairness constraints. Before aggregating locally trained models, it first detects conflicts among their gradients, and then iteratively curates the direction and magnitude of gradients to mitigate these conflicts. Theoretical analysis proves mFairFL facilitates the fairness in model development. The experimental evaluations based on three benchmark datasets show significant advantages of mFairFL compared to seven state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search b/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search new file mode 100644 index 0000000000..da5421fec5 --- /dev/null +++ b/data/2024/aaai/Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search @@ -0,0 +1 @@ +Building click-through rate (CTR) and conversion rate (CVR) prediction models for cross-border e-commerce search requires modeling the correlations among multi-domains. Existing multi-domain methods would suffer severely from poor scalability and low efficiency when number of domains increases. To this end, we propose a Domain-Aware Multi-view mOdel (DAMO), which is domain-number-invariant, to effectively leverage cross-domain relations from a multi-view perspective. Specifically, instead of working in the original feature space defined by different domains, DAMO maps everything to a new low-rank multi-view space. To achieve this, DAMO firstly extracts multi-domain features in an explicit feature-interactive manner. These features are parsed to a multi-view extractor to obtain view-invariant and view-specific features. Then a multi-view predictor inputs these two sets of features and outputs view-based predictions. To enforce view-awareness in the predictor, we further propose a lightweight view-attention estimator to dynamically learn the optimal view-specific weights w.r.t. a view-guided loss. Extensive experiments on public and industrial datasets show that compared with state-of-the-art models, our DAMO achieves better performance with lower storage and computational costs. In addition, deploying DAMO to a large-scale cross-border e-commence platform leads to 1.21%, 1.76%, and 1.66% improvements over the existing CGC-based model in the online AB-testing experiment in terms of CTR, CVR, and Gross Merchandises Value, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection b/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection new file mode 100644 index 0000000000..0a44dd82ad --- /dev/null +++ b/data/2024/aaai/Multi-Domain Incremental Learning for Face Presentation Attack Detection @@ -0,0 +1 @@ +Previous face Presentation Attack Detection (PAD) methods aim to improve the effectiveness of cross-domain tasks. However, in real-world scenarios, the original training data of the pre-trained model is not available due to data privacy or other reasons. Under these constraints, general methods for fine-tuning single-target domain data may lose previously learned knowledge, leading to a catastrophic forgetting problem. To address these issues, we propose a multi-domain incremental learning (MDIL) method for PAD, which not only learns knowledge well from the new domain but also maintains the performance of previous domains stably. Specifically, we propose an adaptive domain-specific experts (ADE) framework based on the vision transformer to preserve the discriminability of previous domains. Furthermore, an asymmetric classifier is designed to keep the output distribution of different classifiers consistent, thereby improving the generalization ability. Extensive experiments show that our proposed method achieves state-of-the-art performance compared to prior methods of incremental learning. Excitingly, under more stringent setting conditions, our method approximates or even outperforms the DA/DG-based methods. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement b/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement new file mode 100644 index 0000000000..0e545a8d81 --- /dev/null +++ b/data/2024/aaai/Multi-Domain Multi-Scale Diffusion Model for Low-Light Image Enhancement @@ -0,0 +1 @@ +Diffusion models have achieved remarkable progress in low-light image enhancement. However, there remain two practical limitations: (1) existing methods mainly focus on the spatial domain for the diffusion process, while neglecting the essential features in the frequency domain; (2) conventional patch-based sampling strategy inevitably leads to severe checkerboard artifacts due to the uneven overlapping. To address these limitations in one go, we propose a Multi-Domain Multi-Scale (MDMS) diffusion model for low-light image enhancement. In particular, we introduce a spatial-frequency fusion module to seamlessly integrates spatial and frequency information. By leveraging the Multi-Domain Learning (MDL) paradigm, our proposed model is endowed with the capability to adaptively facilitate noise distribution learning, thereby enhancing the quality of the generated images. Meanwhile, we propose a Multi-Scale Sampling (MSS) strategy that follows a divide-ensemble manner by merging the restored patches under different resolutions. Such a multi-scale learning paradigm explicitly derives patch information from different granularities, thus leading to smoother boundaries. Furthermore, we empirically adopt the Bright Channel Prior (BCP) which indicates natural statistical regularity as an additional restoration guidance. Experimental results on LOL and LOLv2 datasets demonstrate that our method achieves state-of-the-art performance for the low-light image enhancement task. Codes are available at https://github.com/Oliiveralien/MDMS. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling b/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling new file mode 100644 index 0000000000..4e64b85795 --- /dev/null +++ b/data/2024/aaai/Multi-Domain Recommendation to Attract Users via Domain Preference Modeling @@ -0,0 +1 @@ +Recently, web platforms are operating various service domains simultaneously. Targeting a platform that operates multiple service domains, we introduce a new task, Multi-Domain Recommendation to Attract Users (MDRAU), which recommends items from multiple ``unseen'' domains with which each user has not interacted yet, by using knowledge from the user's ``seen'' domains. In this paper, we point out two challenges of MDRAU task. First, there are numerous possible combinations of mappings from seen to unseen domains because users have usually interacted with a different subset of service domains. Second, a user might have different preference for each of the target unseen domains, which requires recommendations to reflect users' preference on domains as well as items. To tackle these challenges, we propose DRIP framework that models users' preference at two levels (i.e., domain and item) and learns various seen-unseen domain mappings in a unified way with masked domain modeling. Our extensive experiments demonstrate the effectiveness of DRIP in MDRAU task and its ability to capture users' domain-level preferences. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition b/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition new file mode 100644 index 0000000000..07930d179e --- /dev/null +++ b/data/2024/aaai/Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition @@ -0,0 +1 @@ +Illumination variation has been a long-term challenge in real-world facial expression recognition (FER). Under uncontrolled or non-visible light conditions, near-infrared (NIR) can provide a simple and alternative solution to obtain high-quality images and supplement the geometric and texture details that are missing in the visible (VIS) domain. Due to the lack of large-scale NIR facial expression datasets, directly extending VIS FER methods to the NIR spectrum may be ineffective. Additionally, previous heterogeneous image synthesis methods are restricted by low controllability without prior task knowledge. To tackle these issues, we present the first approach, called for NIR-FER Stochastic Differential Equations (NFER-SDE), that transforms face expression appearance between heterogeneous modalities to the overfitting problem on small-scale NIR data. NFER-SDE can take the whole VIS source image as input and, together with domain-specific knowledge, guide the preservation of modality-invariant information in the high-frequency content of the image. Extensive experiments and ablation studies show that NFER-SDE significantly improves the performance of NIR FER and achieves state-of-the-art results on the only two available NIR FER datasets, Oulu-CASIA and Large-HFE. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract) b/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract) new file mode 100644 index 0000000000..2a0b96e1d4 --- /dev/null +++ b/data/2024/aaai/Multi-Expert Distillation for Few-Shot Coordination (Student Abstract) @@ -0,0 +1 @@ +Ad hoc teamwork is a crucial challenge that aims to design an agent capable of effective collaboration with teammates employing diverse strategies without prior coordination. However, current Population-Based Training (PBT) approaches train the ad hoc agent through interaction with diverse teammates from scratch, which suffer from low efficiency. We introduce Multi-Expert Distillation (MED), a novel approach that directly distills diverse strategies through modeling across-episodic sequences. Experiments show that our algorithm achieves more efficient and stable training and has the ability to improve its behavior using historical contexts. Our code is available at https://github.com/LAMDA-RL/MED. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Label Supervised Contrastive Learning b/data/2024/aaai/Multi-Label Supervised Contrastive Learning new file mode 100644 index 0000000000..2ce0f04433 --- /dev/null +++ b/data/2024/aaai/Multi-Label Supervised Contrastive Learning @@ -0,0 +1,7 @@ +Multi-label classification is an arduous problem given the complication in label correlation. Whilst sharing a common goal with contrastive learning in utilizing correlations for representation learning, how to better leverage label information remains challenging. +Previous endeavors include extracting label-level presentations or mapping labels to an embedding space, overlooking the correlation between multiple labels. +It exhibits a great ambiguity in determining positive samples with different extent of label overlap between samples and integrating such relations in loss functions. +In our work, we propose Multi-Label Supervised Contrastive learning (MulSupCon) with a novel contrastive loss function to adjust weights based on how much overlap one sample shares with the anchor. +By analyzing gradients, we explain why our method performs better under multi-label circumstances. +To evaluate, we conduct direct classification and transfer learning on several multi-label datasets, including widely-used image datasets such as MS-COCO and NUS-WIDE. +Validation indicates that our method outperforms the traditional multi-label classification method and shows a competitive performance when comparing to other existing approaches. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering b/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering new file mode 100644 index 0000000000..265f7aafd6 --- /dev/null +++ b/data/2024/aaai/Multi-Level Cross-Modal Alignment for Image Clustering @@ -0,0 +1 @@ +Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-level Cross-modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search b/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search new file mode 100644 index 0000000000..ac04babe38 --- /dev/null +++ b/data/2024/aaai/Multi-Modal Disordered Representation Learning Network for Description-Based Person Search @@ -0,0 +1 @@ +Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models b/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models new file mode 100644 index 0000000000..8873c59463 --- /dev/null +++ b/data/2024/aaai/Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models @@ -0,0 +1 @@ +Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in language models, enhancing their ability to tackle complex real-world problems. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection b/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection new file mode 100644 index 0000000000..445f453c87 --- /dev/null +++ b/data/2024/aaai/Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection @@ -0,0 +1 @@ +Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation b/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation new file mode 100644 index 0000000000..7cbdd243db --- /dev/null +++ b/data/2024/aaai/Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation @@ -0,0 +1 @@ +3D point cloud semantic segmentation has a wide range of applications. Recently, weakly supervised point cloud segmentation methods have been proposed, aiming to alleviate the expensive and laborious manual annotation process by leveraging scene-level labels. However, these methods have not effectively exploited the rich geometric information (such as shape and scale) and appearance information (such as color and texture) present in RGB-D scans. Furthermore, current approaches fail to fully leverage the point affinity that can be inferred from the feature extraction network, which is crucial for learning from weak scene-level labels. Additionally, previous work overlooks the detrimental effects of the long-tailed distribution of point cloud data in weakly supervised 3D semantic segmentation. To this end, this paper proposes a simple yet effective scene-level weakly supervised point cloud segmentation method with a newly introduced multi-modality point affinity inference module. The point affinity proposed in this paper is characterized by features from multiple modalities (e.g., point cloud and RGB), and is further refined by normalizing the classifier weights to alleviate the detrimental effects of long-tailed distribution without the need of the prior of category distribution. Extensive experiments on the ScanNet and S3DIS benchmarks verify the effectiveness of our proposed method, which outperforms the state-of-the-art by ~4% to ~ 6% mIoU. Codes are released at https://github.com/Sunny599/AAAI24-3DWSSG-MMA. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning b/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning new file mode 100644 index 0000000000..7effa94a4a --- /dev/null +++ b/data/2024/aaai/Multi-Objective Bayesian Optimization with Active Preference Learning @@ -0,0 +1 @@ +There are a lot of real-world black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective function and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification b/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification new file mode 100644 index 0000000000..5a404e997e --- /dev/null +++ b/data/2024/aaai/Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification @@ -0,0 +1,2 @@ +The fine-grained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID) +task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and language models, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation b/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation new file mode 100644 index 0000000000..5f04832288 --- /dev/null +++ b/data/2024/aaai/Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation @@ -0,0 +1 @@ +In the domain of scene graph generation, modeling commonsense as a single-prototype representation has been typically employed to facilitate the recognition of infrequent predicates. However, a fundamental challenge lies in the large intra-class variations of the visual appearance of predicates, resulting in subclasses within a predicate class. Such a challenge typically leads to the problem of misclassifying diverse predicates due to the rough predicate space clustering. In this paper, inspired by cognitive science, we maintain multi-prototype representations for each predicate class, which can accurately find the multiple class centers of the predicate space. Technically, we propose a novel multi-prototype learning framework consisting of three main steps: prototype-predicate matching, prototype updating, and prototype space optimization. We first design a triple-level optimal transport to match each predicate feature within the same class to a specific prototype. In addition, the prototypes are updated using momentum updating to find the class centers according to the matching results. Finally, we enhance the inter-class separability of the prototype space through iterations of the inter-class separability loss and intra-class compactness loss. Extensive evaluations demonstrate that our approach significantly outperforms state-of-the-art methods on the Visual Genome dataset. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery b/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery new file mode 100644 index 0000000000..5632f6e162 --- /dev/null +++ b/data/2024/aaai/Multi-Region Text-Driven Manipulation of Diffusion Imagery @@ -0,0 +1 @@ +Text-guided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract) b/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract) new file mode 100644 index 0000000000..4639fd41eb --- /dev/null +++ b/data/2024/aaai/Multi-Scale Dynamic Graph Learning for Time Series Anomaly Detection (Student Abstract) @@ -0,0 +1 @@ +The success of graph neural networks (GNNs) has spurred numerous new works leveraging GNNs for modeling multivariate time series anomaly detection. Despite their achieved performance improvements, most of them only consider static graph to describe the spatial-temporal dependencies between time series. Moreover, existing works neglect the time and scale-changing structures of time series. In this work, we propose MDGAD, a novel multi-scale dynamic graph structure learning approach for time series anomaly detection. We design a multi-scale graph structure learning module that captures the complex correlations among time series, constructing an evolving graph at each scale. Meanwhile, an anomaly detector is used to combine bilateral prediction errors to detect abnormal data. Experiments conducted on two time series datasets demonstrate the effectiveness of MDGAD. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking b/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking new file mode 100644 index 0000000000..e4e9094928 --- /dev/null +++ b/data/2024/aaai/Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking @@ -0,0 +1 @@ +The global multi-object tracking (MOT) system can consider interaction, occlusion, and other ``visual blur'' scenarios to ensure effective object tracking in long videos. Among them, graph-based tracking-by-detection paradigms achieve surprising performance. However, their fully-connected nature poses storage space requirements that challenge algorithm handling long videos. Currently, commonly used methods are still generated trajectories by building one-forward associations across frames. Such matches produced under the guidance of first-order similarity information may not be optimal from a longer-time perspective. Moreover, they often lack an end-to-end scheme for correcting mismatches. This paper proposes the Composite Node Message Passing Network (CoNo-Link), a multi-scene generalized framework for modeling ultra-long frames information for association. CoNo-Link's solution is a low-storage overhead method for building constrained connected graphs. In addition to the previous method of treating objects as nodes, the network innovatively treats object trajectories as nodes for information interaction, improving the graph neural network's feature representation capability. Specifically, we formulate the graph-building problem as a top-k selection task for some reliable objects or trajectories. Our model can learn better predictions on longer-time scales by adding composite nodes. As a result, our method outperforms the state-of-the-art in several commonly used datasets. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization b/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization new file mode 100644 index 0000000000..555ecfc6f0 --- /dev/null +++ b/data/2024/aaai/Multi-Source Collaborative Gradient Discrepancy Minimization for Federated Domain Generalization @@ -0,0 +1 @@ +Federated Domain Generalization aims to learn a domain-invariant model from multiple decentralized source domains for deployment on unseen target domain. Due to privacy concerns, the data from different source domains are kept isolated, which poses challenges in bridging the domain gap. To address this issue, we propose a Multi-source Collaborative Gradient Discrepancy Minimization (MCGDM) method for federated domain generalization. Specifically, we propose intra-domain gradient matching between the original images and augmented images to avoid overfitting the domain-specific information within isolated domains. Additionally, we propose inter-domain gradient matching with the collaboration of other domains, which can further reduce the domain shift across decentralized domains. Combining intra-domain and inter-domain gradient matching, our method enables the learned model to generalize well on unseen domains. Furthermore, our method can be extended to the federated domain adaptation task by fine-tuning the target model on the pseudo-labeled target domain. The extensive experiments on federated domain generalization and adaptation indicate that our method outperforms the state-of-the-art methods significantly. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows b/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows new file mode 100644 index 0000000000..03556a9b10 --- /dev/null +++ b/data/2024/aaai/Multi-Stage Prompting for Next Best Agent Recommendations in Adaptive Workflows @@ -0,0 +1 @@ +Traditional business processes such as loan processing, order processing, or procurement have a series of steps that are pre-defined at design and executed by enterprise systems. Recent advancements in new-age businesses, however, focus on having adaptive and ad-hoc processes by stitching together a set of functions or steps enabled through autonomous agents. Further, to enable business users to execute a flexible set of steps, there have been works on providing a conversational interface to interact and execute automation. Often, it is necessary to guide the user through the set of possible steps in the process (or workflow). Existing work on recommending the next agent to run relies on historical data. However, with changing workflows and new automation constantly getting added, it is important to provide recommendations without historical data. Additionally, hand-crafted recommendation rules do not scale. The adaptive workflow being a combination of structured and unstructured information, makes it harder to mine. Hence, in this work, we leverage Large Language Models (LLMs) to combine process knowledge with the meta-data of agents to discover NBAs specifically at cold-start. We propose a multi-stage approach that uses existing process knowledge and agent meta-data information to prompt LLM and recommend meaningful next best agent (NBA) based on user utterances. \ No newline at end of file diff --git a/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models b/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models new file mode 100644 index 0000000000..762182e3bd --- /dev/null +++ b/data/2024/aaai/Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models @@ -0,0 +1 @@ +Denoising Diffusion Probabilistic Models (DDPMs) have achieved significant success in generation tasks. Nevertheless, the exposure bias issue, i.e., the natural discrepancy between the training (the output of each step is calculated individually by a given input) and inference (the output of each step is calculated based on the input iteratively obtained based on the model), harms the performance of DDPMs. To our knowledge, few works have tried to tackle this issue by modifying the training process for DDPMs, but they still perform unsatisfactorily due to 1) partially modeling the discrepancy and 2) ignoring the prediction error accumulation. To address the above issues, in this paper, we propose a multi-step denoising scheduled sampling (MDSS) strategy to alleviate the exposure bias for DDPMs. Analyzing the formulations of the training and inference of DDPMs, MDSS 1) comprehensively considers the discrepancy influence of prediction errors on the output of the model (the Gaussian noise) and the output of the step (the calculated input signal of the next step), and 2) efficiently models the prediction error accumulation by using multiple iterations of a mathematical formulation initialized from one-step prediction error obtained from the model. The experimental results, compared with previous works, demonstrate that our approach is more effective in mitigating exposure bias in DDPM, DDIM, and DPM-solver. In particular, MDSS achieves an FID score of 3.86 in 100 sample steps of DDIM on the CIFAR-10 dataset, whereas the second best obtains 4.78. The code will be available on GitHub. \ No newline at end of file diff --git a/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection b/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection new file mode 100644 index 0000000000..7de060c1d5 --- /dev/null +++ b/data/2024/aaai/Multi-View Dynamic Reflection Prior for Video Glass Surface Detection @@ -0,0 +1 @@ +Recent research has shown significant interest in image-based glass surface detection (GSD). However, detecting glass surfaces in dynamic scenes remains largely unexplored due to the lack of a high-quality dataset and an effective video glass surface detection (VGSD) method. In this paper, we propose the first VGSD approach. Our key observation is that reflections frequently appear on glass surfaces, but they change dynamically as the camera moves. Based on this observation, we propose to offset the excessive dependence on a single uncertainty reflection via joint modeling of temporal and spatial reflection cues. To this end, we propose the VGSD-Net with two novel modules: a Location-aware Reflection Extraction (LRE) module and a Context-enhanced Reflection Integration (CRI) module, for the position-aware reflection feature extraction and the spatial-temporal reflection cues integration, respectively. We have also created the first large-scale video glass surface dataset (VGSD-D), consisting of 19,166 image frames with accurately-annotated glass masks extracted from 297 videos. Extensive experiments demonstrate that VGSD-Net outperforms state-of-the-art approaches adapted from related fields. Code and dataset will be available at https://github.com/fawnliu/VGSD. \ No newline at end of file diff --git a/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting b/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting new file mode 100644 index 0000000000..5bbe5c1faf --- /dev/null +++ b/data/2024/aaai/Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting @@ -0,0 +1 @@ +Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. \ No newline at end of file diff --git a/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization b/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization new file mode 100644 index 0000000000..90bebfe769 --- /dev/null +++ b/data/2024/aaai/Multi-View Randomized Kernel Classification via Nonconvex Optimization @@ -0,0 +1,8 @@ +Multi kernel learning (MKL) is a representative supervised multi-view learning method widely applied in multi-modal and multi-view applications. +MKL aims to classify data by integrating complementary information from predefined kernels. +Although existing MKL methods achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance. +In this paper, we tackle this problem by generating a number of high-quality base learning kernels and selecting a kernel subset with maximum pairwise diversity and minimum generalization errors. +We first formulate this idea as a nonconvex quadratic integer programming problem. +Then we transform this nonconvex problem into a convex optimization problem and prove it is equivalent to a semidefinite relaxation problem, which a semidefinite-based branch-and-bound algorithm can quickly solve. +Experimental results on the real-world datasets demonstrate the superiority of the proposed method. +The results also show that our method works for the support vector machine (SVM) classifier and other state-of-the-art kernel classifiers. \ No newline at end of file diff --git a/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning b/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning new file mode 100644 index 0000000000..ca422ba088 --- /dev/null +++ b/data/2024/aaai/Multi-world Model in Continual Reinforcement Learning @@ -0,0 +1 @@ +World Models are made of generative networks that can predict future states of a single environment which it was trained on. This research proposes a Multi-world Model, a foundational model built from World Models for the field of continual reinforcement learning that is trained on many different environments, enabling it to generalize state sequence predictions even for unseen settings. \ No newline at end of file diff --git a/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships b/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships new file mode 100644 index 0000000000..676146beb7 --- /dev/null +++ b/data/2024/aaai/MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships @@ -0,0 +1 @@ +Social summarization aims to provide summaries for a large number of social texts (called posts) about a single topic. To extract a summary, both the representation of post and summary selection method are crucial. Previous methods introduce social relation to enhance post embedding to mitigate the sparse representation due to its brief and informal expression. However, they ignore that there are multiple relations between posts. Besides, existing graph-based centrality calculation approaches tend to select posts from one aspect. This leads to facet bias especially when there are multiple viewpoints. In this paper, we propose a model named MultiSum to improve social summarization. Specifically, 1) We use graph convolutional networks to fuse text content with social and semantic relations to improve post representation; 2) The similarity between the summary and all aspects is incorporated into the centrality score during the selection phase, encouraging the model to pay attention to different facets. Experimental results on English and Chinese corpora support the effectiveness of this model. Furthermore, external evaluations by human experts and large language models demonstrate the validity of MultiSum in facet coverage and redundancy reduction. \ No newline at end of file diff --git a/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces b/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces new file mode 100644 index 0000000000..729e659c17 --- /dev/null +++ b/data/2024/aaai/Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces @@ -0,0 +1 @@ +AlphaZero and MuZero have achieved state-of-the-art (SOTA) performance in a wide range of domains, including board games and robotics, with discrete and continuous action spaces. However, to obtain an improved policy, they often require an excessively large number of simulations, especially for domains with large action spaces. As the simulation budget decreases, their performance drops significantly. In addition, many important real-world applications have combinatorial (or exponential) action spaces, making it infeasible to search directly over all possible actions. In this paper, we extend AlphaZero and MuZero to learn and plan in more complex multiagent (MA) Markov decision processes, where the action spaces increase exponentially with the number of agents. Our new algorithms, MA Gumbel AlphaZero and MA Gumbel MuZero, respectively without and with model learning, achieve superior performance on cooperative multiagent control problems, while reducing the number of environmental interactions by up to an order of magnitude compared to model-free approaches. In particular, we significantly improve prior performance when planning with much fewer simulation budgets. The code and appendix are available at https://github.com/tjuHaoXiaotian/MA-MuZero. \ No newline at end of file diff --git a/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation b/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation new file mode 100644 index 0000000000..e2ece75f0f --- /dev/null +++ b/data/2024/aaai/Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation @@ -0,0 +1 @@ +Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks. \ No newline at end of file diff --git a/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction b/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction new file mode 100644 index 0000000000..0d352567fd --- /dev/null +++ b/data/2024/aaai/Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction @@ -0,0 +1 @@ +Prediction of drug-target interactions (DTIs) is a crucial step in drug discovery, and deep learning methods have shown great promise on various DTI datasets. However, existing approaches still face several challenges, including limited labeled data, hidden bias issue, and a lack of generalization ability to out-of-domain data. These challenges hinder the model's capacity to learn truly informative interaction features, leading to shortcut learning and inferior predictive performance on novel drug-target pairs. To address these issues, we propose MlanDTI, a semi-supervised domain adaptive multilevel attention network (Mlan) for DTI prediction. We utilize two pre-trained BERT models to acquire bidirectional representations enriched with information from unlabeled data. Then, we introduce a multilevel attention mechanism, enabling the model to learn domain-invariant DTIs at different hierarchical levels. Moreover, we present a simple yet effective semi-supervised pseudo-labeling method to further enhance our model's predictive ability in cross-domain scenarios. Experiments on four datasets show that MlanDTI achieves state-of-the-art performances over other methods under intra-domain settings and outperforms all other approaches under cross-domain settings. The source code is available at https://github.com/CMACH508/MlanDTI. \ No newline at end of file diff --git a/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract) b/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract) new file mode 100644 index 0000000000..820ade4b2b --- /dev/null +++ b/data/2024/aaai/Multilingual Medical Language Models: A Path to Improving Lay Health Worker Effectiveness (Student Abstract) @@ -0,0 +1,3 @@ +The COVID-19 pandemic has exacerbated the challenges faced by healthcare delivery in developing nations, placing additional strain on already fragile infrastructure and healthcare systems. This has prompted an increased reliance on lay healthcare workers (LHWs) to meet the surging demand for services. Due to limited formal training, many LHWs have resorted to using unreliable sources, such as internet searches, to access medical information. + +Large language models (LLMs) offer a promising opportunity to support LHWs by providing accurate, context-sensitive information for improving healthcare delivery, provided they are appropriately fine-tuned on domain-specific multilingual data. This paper delves into critical issues and presents potential solutions for developing LLM-powered virtual assistants tailored to LHWs serving Telugu and Hindi-speaking populations. Key focal points include the customization of language and content to suit local contexts, the integration of feedback mechanisms to continuously enhance assistance quality, and the delicate balance between automation and human oversight. \ No newline at end of file diff --git a/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification b/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification new file mode 100644 index 0000000000..3d8c144af1 --- /dev/null +++ b/data/2024/aaai/Multimodal Ensembling for Zero-Shot Image Classification @@ -0,0 +1 @@ +Artificial intelligence has made significant progress in image classification, an essential task for machine perception to achieve human-level image understanding. Despite recent advances in vision-language fields, multimodal image classification is still challenging, particularly for the following two reasons. First, models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Second, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. Here, we utilize ensemble learning to reduce the impact of these issues on pre-trained models. We aim to create a meta-model that combines the predictions of multiple open-vocabulary multimodal models trained on different data to create more robust and accurate predictions. By utilizing ensemble learning and multimodal machine learning, we will achieve higher prediction accuracies without any additional training or fine-tuning, meaning that this method is completely zero-shot. \ No newline at end of file diff --git a/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network b/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network new file mode 100644 index 0000000000..a6afdb7011 --- /dev/null +++ b/data/2024/aaai/Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network @@ -0,0 +1 @@ +Multimodal event causality reasoning aims to recognize the causal relations based on the given events and accompanying image pairs, requiring the model to have a comprehensive grasp of visual and textual information. However, existing studies fail to effectively model the relations of the objects within the image and capture the object interactions across the image pair, resulting in an insufficient understanding of visual information by the model. To address these issues, we propose a Scene Graph Enhanced Interaction Network (SEIN) in this paper, which can leverage the interactions of the generated scene graph for multimodal event causality reasoning. Specifically, the proposed method adopts a graph convolutional network to model the objects and their relations derived from the scene graph structure, empowering the model to exploit the rich structural and semantic information in the image adequately. To capture the object interactions between the two images, we design an optimal transport-based alignment strategy to match the objects across the images, which could help the model recognize changes in visual information and facilitate causality reasoning. In addition, we introduce a cross-modal fusion module to combine textual and visual features for causality prediction. Experimental results indicate that the proposed SEIN outperforms state-of-the-art methods on the Vis-Causal dataset. \ No newline at end of file diff --git a/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts b/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts new file mode 100644 index 0000000000..d13cc85672 --- /dev/null +++ b/data/2024/aaai/Multimodal Graph Neural Architecture Search under Distribution Shifts @@ -0,0 +1 @@ +Multimodal graph neural architecture search (MGNAS) has shown great success for automatically designing the optimal multimodal graph neural network (MGNN) architecture by leveraging multimodal representation, crossmodal information and graph structure in one unified framework. However, existing MGNAS fails to handle distribution shifts that naturally exist in multimodal graph data, since the searched architectures inevitably capture spurious statistical correlations under distribution shifts. To solve this problem, we propose a novel Out-of-distribution Generalized Multimodal Graph Neural Architecture Search (OMG-NAS) method which optimizes the MGNN architecture with respect to its performance on decorrelated OOD data. Specifically, we propose a multimodal graph representation decorrelation strategy, which encourages the searched MGNN model to output representations that eliminate spurious correlations through iteratively optimizing the feature weights and controller. In addition, we propose a global sample weight estimator that facilitates the sharing of optimal sample weights learned from existing architectures. This design promotes the effective estimation of the sample weights for candidate MGNN architectures to generate decorrelated multimodal graph representations, concentrating more on the truly predictive relations between invariant features and ground-truth labels. Extensive experiments on real-world multimodal graph datasets demonstrate the superiority of our proposed method over SOTA baselines. \ No newline at end of file diff --git a/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering b/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering new file mode 100644 index 0000000000..9005fcba22 --- /dev/null +++ b/data/2024/aaai/Multiobjective Lipschitz Bandits under Lexicographic Ordering @@ -0,0 +1 @@ +This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize ? objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract) b/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract) new file mode 100644 index 0000000000..9af6f951e9 --- /dev/null +++ b/data/2024/aaai/Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract) @@ -0,0 +1 @@ +Entity Resolution (ER) is the problem of algorithmically matching records, mentions, or entries that refer to the same underlying real-world entity. Traditionally, the problem assumes (at most) two datasets, between which records need to be matched. There is considerably less research in ER when k > 2 datasets are involved. The evaluation of such multipartite ER (M-ER) is especially complex, since the usual ER metrics assume (whether implicitly or explicitly) k < 3. This paper takes the first step towards motivating a k-tuple approach for evaluating M-ER. Using standard algorithms and k-tuple versions of metrics like precision and recall, our preliminary results suggest a significant difference compared to aggregated pairwise evaluation, which would first decompose the M-ER problem into independent bipartite problems and then aggregate their metrics. Hence, M-ER may be more challenging and warrant more novel approaches than current decomposition-based pairwise approaches would suggest. \ No newline at end of file diff --git a/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions b/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions new file mode 100644 index 0000000000..e356e4d6bf --- /dev/null +++ b/data/2024/aaai/Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions @@ -0,0 +1,3 @@ +In many real-world applications, from robotics to pedestrian trajectory prediction, there is a need to predict multiple real-valued outputs to represent several potential scenarios. Current deep learning techniques to address multiple-output problems are based on two main methodologies: (1) mixture density networks, which suffer from poor stability at high dimensions, or (2) multiple choice learning (MCL), an approach that uses M single-output functions, each only producing a point estimate hypothesis. This paper presents a Mixture of Multiple-Output functions (MoM) approach using a novel variant of dropout, Multiple Hypothesis Dropout. Unlike traditional MCL-based approaches, each multiple-output function not only estimates the mean but also the variance for its hypothesis. This is achieved through a novel stochastic winner-take-all loss which allows each multiple-output function to estimate variance through the spread of its subnetwork predictions. +Experiments on supervised learning problems illustrate that our approach outperforms existing solutions for reconstructing multimodal output distributions. +Additional studies on unsupervised learning problems show that estimating the parameters of latent posterior distributions within a discrete autoencoder significantly improves codebook efficiency, sample quality, precision and recall. \ No newline at end of file diff --git a/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization b/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization new file mode 100644 index 0000000000..f7de1ed3eb --- /dev/null +++ b/data/2024/aaai/Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization @@ -0,0 +1,2 @@ +Due to the significance of its various applications, source localization has garnered considerable attention as one of the most important means to confront diffusion hazards. Multi-source localization from a single-snapshot observation is especially relevant due to its prevalence. However, the inherent complexities of this problem, such as limited information, interactions among sources, and dependence on diffusion models, pose challenges to resolution. Current methods typically utilize heuristics and greedy selection, and they are usually bonded with one diffusion model. Consequently, their effectiveness is constrained. +To address these limitations, we propose a simulation-based method termed BOSouL. Bayesian optimization (BO) is adopted to approximate the results for its sample efficiency. A surrogate function models uncertainty from the limited information. It takes sets of nodes as the input instead of individual nodes. BOSouL can incorporate any diffusion model in the data acquisition process through simulations. Empirical studies demonstrate that its performance is robust across graph structures and diffusion models. The code is available at https://github.com/XGraph-Team/BOSouL. \ No newline at end of file diff --git a/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems b/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems new file mode 100644 index 0000000000..cbb9b66e29 --- /dev/null +++ b/data/2024/aaai/Multiscale Attention Wavelet Neural Operator for Capturing Steep Trajectories in Biochemical Systems @@ -0,0 +1 @@ +In biochemical modeling, some foundational systems can exhibit sudden and profound behavioral shifts, such as the cellular signaling pathway models, in which the physiological responses promptly react to environmental changes, resulting in steep changes in their dynamic model trajectories. These steep changes are one of the major challenges in biochemical modeling governed by nonlinear differential equations. One promising way to tackle this challenge is converting the input data from the time domain to the frequency domain through Fourier Neural Operators, which enhances the ability to analyze data periodicity and regularity. However, the effectiveness of these Fourier based methods diminishes in scenarios with complex abrupt switches. To address this limitation, an innovative Multiscale Attention Wavelet Neural Operator (MAWNO) method is proposed in this paper, which comprehensively combines the attention mechanism with the versatile wavelet transforms to effectively capture these abrupt switches. Specifically, the wavelet transform scrutinizes data across multiple scales to extract the characteristics of abrupt signals into wavelet coefficients, while the self-attention mechanism is adeptly introduced to enhance the wavelet coefficients in high-frequency signals that can better characterize the abrupt switches. Experimental results substantiate MAWNO’s supremacy in terms of accuracy on three classical biochemical models featuring periodic and steep trajectories. https://github.com/SUDERS/MAWNO. \ No newline at end of file diff --git a/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks b/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks new file mode 100644 index 0000000000..011248eed7 --- /dev/null +++ b/data/2024/aaai/Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks @@ -0,0 +1 @@ +Deep learning and Convolutional Neural Networks (CNNs) have driven major transformations in diverse research areas. However, their limitations in handling low-frequency in-formation present obstacles in certain tasks like interpreting global structures or managing smooth transition images. Despite the promising performance of transformer struc-tures in numerous tasks, their intricate optimization com-plexities highlight the persistent need for refined CNN en-hancements using limited resources. Responding to these complexities, we introduce a novel framework, the Mul-tiscale Low-Frequency Memory (MLFM) Network, with the goal to harness the full potential of CNNs while keep-ing their complexity unchanged. The MLFM efficiently preserves low-frequency information, enhancing perfor-mance in targeted computer vision tasks. Central to our MLFM is the Low-Frequency Memory Unit (LFMU), which stores various low-frequency data and forms a parallel channel to the core network. A key advantage of MLFM is its seamless compatibility with various prevalent networks, requiring no alterations to their original core structure. Testing on ImageNet demonstrated substantial accuracy improvements in multiple 2D CNNs, including ResNet, MobileNet, EfficientNet, and ConvNeXt. Furthermore, we showcase MLFM's versatility beyond traditional image classification by successfully integrating it into image-to-image translation tasks, specifically in semantic segmenta-tion networks like FCN and U-Net. In conclusion, our work signifies a pivotal stride in the journey of optimizing the ef-ficacy and efficiency of CNNs with limited resources. This research builds upon the existing CNN foundations and paves the way for future advancements in computer vision. Our codes are available at https://github.com/AlphaWuSeu/MLFM. \ No newline at end of file diff --git a/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract) b/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract) new file mode 100644 index 0000000000..37f0938605 --- /dev/null +++ b/data/2024/aaai/Multivariate Time-Series Imagification with Time Embedding in Constrained Environments (Student Abstract) @@ -0,0 +1 @@ +We present an imagification approach for multivariate time-series data tailored to constrained NN-based forecasting model training environments. Our imagification process consists of two key steps: Re-stacking and time embedding. In the Re-stacking stage, time-series data are arranged based on high correlation, forming the first image channel using a sliding window technique. The time embedding stage adds two additional image channels by incorporating real-time information. We evaluate our method by comparing it with three benchmark imagification techniques using a simple CNN-based model. Additionally, we conduct a comparison with LSTM, a conventional time-series forecasting model. Experimental results demonstrate that our proposed approach achieves three times faster model training termination while maintaining forecasting accuracy. \ No newline at end of file diff --git a/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion b/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion new file mode 100644 index 0000000000..7d14db1b13 --- /dev/null +++ b/data/2024/aaai/MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion @@ -0,0 +1 @@ +Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learning-based emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements. \ No newline at end of file diff --git a/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models b/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models new file mode 100644 index 0000000000..f0de97cb4a --- /dev/null +++ b/data/2024/aaai/Music Style Transfer with Time-Varying Inversion of Diffusion Models @@ -0,0 +1 @@ +With the development of diffusion models, text-guided image style transfer has demonstrated great controllable and high-quality results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we utilize a bias-reduced stylization technique to get stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and code are available at https://lsfhuihuiff.github.io/MusicTI/. \ No newline at end of file diff --git a/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation b/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation new file mode 100644 index 0000000000..03a4bb60f8 --- /dev/null +++ b/data/2024/aaai/Mutual-Modality Adversarial Attack with Semantic Perturbation @@ -0,0 +1,6 @@ +Adversarial attacks constitute a notable threat to machine learning systems, given their potential to induce erroneous predictions and classifications. However, within real-world contexts, the essential specifics of the deployed model are frequently treated as a black box, consequently mitigating the vulnerability to such attacks. +Thus, enhancing the transferability of the adversarial samples has become a crucial area of research, which heavily relies on selecting appropriate surrogate models. +To address this challenge, we propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach is accomplished by leveraging the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean image that causes semantic perturbations on the aligned embedding space with the other textual modality. +Then, we apply the corresponding defense on the textual modality by updating the prompts, which forces the re-matching on the perturbed embedding space. +Finally, to enhance the attack transferability, we utilize the iterative training strategy on the visual attack and the textual defense, where the two processes optimize from each other. +We evaluate our approach on several benchmark datasets and demonstrate that our mutual-modal attack strategy can effectively produce high-transferable attacks, which are stable regardless of the target networks. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution. \ No newline at end of file diff --git a/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding b/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding new file mode 100644 index 0000000000..168b3253a7 --- /dev/null +++ b/data/2024/aaai/N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding @@ -0,0 +1,5 @@ +The first step to apply deep learning techniques for symbolic music understanding is to transform musical pieces (mainly in MIDI format) into sequences of predefined tokens like note pitch, note velocity, and chords. Subsequently, the sequences are fed into a neural sequence model to accomplish specific tasks. +Music sequences exhibit strong correlations between adjacent elements, making them prime candidates for N-gram techniques from Natural Language Processing (NLP). Consider classical piano music: specific melodies might recur throughout a piece, with subtle variations each time. +In this paper, we propose a novel method, NG-Midiformer, for understanding symbolic music sequences that leverages the N-gram approach. Our method involves first processing music pieces into word-like sequences with our proposed unsupervised compoundation, followed by using our N-gram Transformer encoder, which can effectively incorporate N-gram information to enhance the primary encoder part for better understanding of music sequences. +The pre-training process on large-scale music datasets enables the model to thoroughly learn the N-gram information contained within music sequences, and subsequently apply this information for making inferences during the fine-tuning stage. +Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks. The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer. \ No newline at end of file diff --git a/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model b/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model new file mode 100644 index 0000000000..6a1c313572 --- /dev/null +++ b/data/2024/aaai/ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model @@ -0,0 +1 @@ +Cross-sensory interaction is a key aspect for multisensory recognition. Without cross-sensory interaction, artificial neural networks show inferior performance in multisensory recognition. On the contrary, the human brain has an inherently remarkable ability in multisensory recognition, which stems from the diverse neurons that exhibit distinct responses to sensory inputs, especially the multisensory neurons with multisensory responses hence enabling cross-sensory interaction. Based on this neuronal diversity, we propose a Neuronal Diversity inspired Multisensory Recognition Model (ND-MRM), which, similar to the brain, comprises unisensory neurons and multisensory neurons. To reflect the different responses characteristics of diverse neurons in the brain, special connection constraints are innovatively designed to regulate the features transmission in the ND-MRM. Leveraging this novel concept of neuronal diversity, our model is biologically plausible, enabling more effective recognition of multisensory information. To validate the performance of the proposed ND-MRM, we employ a multisensory emotion recognition task as a case study. The results demonstrate that our model surpasses state-of-the-art brain-inspired baselines on two datasets, proving the potential of brain-inspired methods for advancing multisensory interaction and recognition. \ No newline at end of file diff --git a/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation b/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation new file mode 100644 index 0000000000..36033c4238 --- /dev/null +++ b/data/2024/aaai/NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation @@ -0,0 +1 @@ +Causal effect estimation from observational data is a central problem in causal inference. Methods based on potential outcomes framework solve this problem by exploiting inductive biases and heuristics from causal inference. Each of these methods addresses a specific aspect of causal effect estimation, such as controlling propensity score, enforcing randomization, etc., by designing neural network (NN) architectures and regularizers. In this paper, we propose an adaptive method called Neurosymbolic Causal Effect Estimator (NESTER), a generalized method for causal effect estimation. NESTER integrates the ideas used in existing methods based on multi-head NNs for causal effect estimation into one framework. We design a Domain Specific Language (DSL) tailored for causal effect estimation based on causal inductive biases used in literature. We conduct a theoretical analysis to investigate NESTER's efficacy in estimating causal effects. Our comprehensive empirical results show that NESTER performs better than state-of-the-art methods on benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement b/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement new file mode 100644 index 0000000000..f7bbf184df --- /dev/null +++ b/data/2024/aaai/NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement @@ -0,0 +1 @@ +3D lookup tables (3D LUTs) are a key component for image enhancement. Modern image signal processors (ISPs) have dedicated support for these as part of the camera rendering pipeline. Cameras typically provide multiple options for picture styles, where each style is usually obtained by applying a unique handcrafted 3D LUT. Current approaches for learning and applying 3D LUTs are notably fast, yet not so memory-efficient, as storing multiple 3D LUTs is required. For this reason and other implementation limitations, their use on mobile devices is less popular. In this work, we propose a Neural Implicit LUT (NILUT), an implicitly defined continuous 3D color transformation parameterized by a neural network. We show that NILUTs are capable of accurately emulating real 3D LUTs. Moreover, a NILUT can be extended to incorporate multiple styles into a single network with the ability to blend styles implicitly. Our novel approach is memory-efficient, controllable and can complement previous methods, including learned ISPs. Code at https://github.com/mv-lab/nilut \ No newline at end of file diff --git a/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem b/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem new file mode 100644 index 0000000000..8e4b57c006 --- /dev/null +++ b/data/2024/aaai/NN-Steiner: A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem @@ -0,0 +1 @@ +Recent years have witnessed rapid advances in the use of neural networks to solve combinatorial optimization problems. Nevertheless, designing the "right" neural model that can effectively handle a given optimization problem can be challenging, and often there is no theoretical understanding or justification of the resulting neural model. In this paper, we focus on the rectilinear Steiner minimum tree (RSMT) problem, which is of critical importance in IC layout design and as a result has attracted numerous heuristic approaches in the VLSI literature. Our contributions are two-fold. On the methodology front, we propose NN-Steiner which is a novel mixed neural-algorithmic framework for computing RSMTs that leverages the celebrated PTAS algorithmic framework of Arora to solve this problem (and other geometric optimization problems). Our NN-Steiner replaces key algorithmic components within Arora's PTAS by suitable neural components. In particular, NN-Steiner only needs four neural network (NN) components that are called repeatedly within an algorithmic framework. Crucially, each of the four NN components is only of bounded size independent of input size, and thus easy to train. Furthermore, as the NN component is learning a generic algorithmic step, once learned, the resulting mixed neural-algorithmic framework generalizes to much larger instances not seen in training. Our NN-Steiner, to our best knowledge, is the first neural architecture of bounded size that has capacity to approximately solve RSMT (and variants). On the empirical front, we show how NN-Steiner can be implemented and demonstrate the effectiveness of our resulting approach, especially in terms of generalization, by comparing with state-of-the-art methods (both neural and non-neural based). \ No newline at end of file diff --git a/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images b/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images new file mode 100644 index 0000000000..cfa3dd1c74 --- /dev/null +++ b/data/2024/aaai/NaMa: Neighbor-Aware Multi-Modal Adaptive Learning for Prostate Tumor Segmentation on Anisotropic MR Images @@ -0,0 +1 @@ +Accurate segmentation of prostate tumors from multi-modal magnetic resonance (MR) images is crucial for diagnosis and treatment of prostate cancer. However, the robustness of existing segmentation methods is limited, mainly because these methods 1) fail to adaptively assess subject-specific information of each MR modality for accurate tumor delineation, and 2) lack effective utilization of inter-slice information across thick slices in MR images to segment tumor as a whole 3D volume. In this work, we propose a two-stage neighbor-aware multi-modal adaptive learning network (NaMa) for accurate prostate tumor segmentation from multi-modal anisotropic MR images. In particular, in the first stage, we apply subject-specific multi-modal fusion in each slice by developing a novel modality-informativeness adaptive learning (MIAL) module for selecting and adaptively fusing informative representation of each modality based on inter-modality correlations. In the second stage, we exploit inter-slice feature correlations to derive volumetric tumor segmentation. Specifically, we first use a Unet variant with sequence layers to coarsely capture slice relationship at a global scale, and further generate an activation map for each slice. Then, we introduce an activation mapping guidance (AMG) module to refine slice-wise representation (via information from adjacent slices) for consistent tumor segmentation across neighboring slices. Besides, during the network training, we further apply a random mask strategy to each MR modality to improve feature representation efficiency. Experiments on both in-house and public (PICAI) multi-modal prostate tumor datasets show that our proposed NaMa performs better than state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts b/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts new file mode 100644 index 0000000000..530c20c3ce --- /dev/null +++ b/data/2024/aaai/NaRuto: Automatically Acquiring Planning Models from Narrative Texts @@ -0,0 +1 @@ +Domain model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Learning action models from narrative texts in an automated way is essential to overcome this barrier, but challenging because of the inherent complexities of such texts. We present an evaluation of planning domain models derived from narrative texts using our fully automated, unsupervised system, NaRuto. Our system combines structured event extraction, predictions of commonsense event relations, and textual contradictions and similarities. Evaluation results show that NaRuto generates domain models of significantly better quality than existing fully automated methods, and even sometimes on par with those created by semi-automated methods, with human assistance. \ No newline at end of file diff --git a/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing b/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing new file mode 100644 index 0000000000..364c900ff4 --- /dev/null +++ b/data/2024/aaai/NarrativePlay: An Automated System for Crafting Visual Worlds in Novels for Role-Playing @@ -0,0 +1 @@ +In this demo, we present NarrativePlay -- an innovative system enabling users to role-play a fictional character and interact with dynamically generated narrative environments. Unlike existing predefined sandbox approaches, NarrativePlay centres around the main storyline events extracted from the narrative, allowing users to experience the story from the perspective of a character they chose. To design versatile AI agents for diverse scenarios, we employ a framework built on a Large Language Models (LLMs) to extract detailed character traits from text. We also incorporate automatically generated visual displays of narrative settings, character portraits, and character speech, greatly enhancing the overall user experience. \ No newline at end of file diff --git a/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems b/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems new file mode 100644 index 0000000000..3f43fb8e93 --- /dev/null +++ b/data/2024/aaai/Natural Strategic Ability in Stochastic Multi-Agent Systems @@ -0,0 +1 @@ +Strategies synthesized using formal methods can be complex and often require infinite memory, which does not correspond to the expected behavior when trying to model Multi-Agent Systems (MAS). To capture such behaviors, natural strategies are a recently proposed framework striking a balance between the ability of agents to strategize with memory and the complexity of the model-checking problem, but until now has been restricted to fully deterministic settings. For the first time, we consider the probabilistic temporal logics PATL and PATL∗ under natural strategies (NatPATL and NatPATL∗). As main result we show that, in stochastic MAS, NatPATL model-checking is NP-complete when the active coalition is restricted to deterministic strategies. We also give a 2NEXPTIME complexity result for NatPATL∗ with the same restriction. In the unrestricted case, we give an EXPSPACE complexity for NatPATL and 3EXPSPACE complexity for NatPATL*. \ No newline at end of file diff --git a/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models b/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models new file mode 100644 index 0000000000..0493d148d3 --- /dev/null +++ b/data/2024/aaai/NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models @@ -0,0 +1 @@ +Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT. \ No newline at end of file diff --git a/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition b/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition new file mode 100644 index 0000000000..1f638da71e --- /dev/null +++ b/data/2024/aaai/Navigating Open Set Scenarios for Skeleton-Based Action Recognition @@ -0,0 +1 @@ +In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information.To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. We will release the benchmark, code, and models to the community. \ No newline at end of file diff --git a/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes b/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes new file mode 100644 index 0000000000..bc1e206d66 --- /dev/null +++ b/data/2024/aaai/Navigating Real-World Partial Label Learning: Unveiling Fine-Grained Images with Attributes @@ -0,0 +1 @@ +Partial label learning (PLL), a significant research area, addresses the challenge of annotating each sample with a candidate label set containing the true label when obtaining accurate labels is infeasible. However, existing PLL methods often rely on generic datasets like CIFAR, where annotators can readily differentiate candidate labels and are unlikely to confuse, making it less realistic for real-world partial label applications. In response, our research focuses on a rarely studied problem, PLL on fine-grained images with attributes. And we propose a novel framework called Shared to Learn, Distinct to Disambiguate (SoDisam). Within the candidate label set, the categories may exhibit numerous shared attribute features, posing a challenge in accurately distinguishing them. Rather than perceiving it as an impediment, we capitalize on these shared attributes as definitive sources of supervision. This insight guides us to learn attribute space visual representation to focus on the information from these shared attributes. Moreover, we introduce an attribute attention mechanism tailored to harness the remaining distinct attributes. This mechanism directs the originally holistic feature towards specific regions, capturing corresponding discriminative features. In addition, a dynamic disambiguation module is introduced, continuously adjusting the two aforementioned mechanisms and achieve the final disambiguation process. Extensive experiments demonstrate the effectiveness of our approach on fine-grained partial label datasets. The proposed SoDisam framework not only addresses the challenges associated with fine-grained partial label learning but also provides a more realistic representation of real-world partial label scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning b/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning new file mode 100644 index 0000000000..6a5c31767a --- /dev/null +++ b/data/2024/aaai/Navigating Uncertainty in Epidemic Contexts with Reinforcement Learning @@ -0,0 +1 @@ +My research integrates stochastic epidemic models with reinforcement learning to develop effective strategies or policies to inform operational decisions. The objective is to refine policies that are attuned to diverse outbreak dynamics and to offer a tool for informed planning in real-world settings. \ No newline at end of file diff --git a/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs b/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs new file mode 100644 index 0000000000..a6a8d54324 --- /dev/null +++ b/data/2024/aaai/NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs @@ -0,0 +1 @@ +Panoramic radiography (Panoramic X-ray, PX) is a widely used imaging modality for dental examination. However, PX only provides a flattened 2D image, lacking in a 3D view of the oral structure. In this paper, we propose NeBLa (Neural Beer-Lambert) to estimate 3D oral structures from real-world PX. NeBLa tackles full 3D reconstruction for varying subjects (patients) where each reconstruction is based only on a single panoramic image. We create an intermediate representation called simulated PX (SimPX) from 3D Cone-beam computed tomography (CBCT) data based on the Beer-Lambert law of X-ray rendering and rotational principles of PX imaging. SimPX aims at not only truthfully simulating PX, but also facilitates the reverting process back to 3D data. We propose a novel neural model based on ray tracing which exploits both global and local input features to convert SimPX to 3D output. At inference, a real PX image is translated to a SimPX-style image with semantic regularization, and the translated image is processed by generation module to produce high-quality outputs. Experiments show that NeBLa outperforms prior state-of-the-art in reconstruction tasks both quantitatively and qualitatively. Unlike prior methods, NeBLa does not require any prior information such as the shape of dental arches, nor the matched PX-CBCT dataset for training, which is difficult to obtain in clinical practice. Our code is available at https://github.com/sihwa-park/nebla. \ No newline at end of file diff --git a/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields b/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields new file mode 100644 index 0000000000..678879b9f0 --- /dev/null +++ b/data/2024/aaai/NeRF-LiDAR: Generating Realistic LiDAR Point Clouds with Neural Radiance Fields @@ -0,0 +1,2 @@ +Labelling LiDAR point clouds for training autonomous driving is extremely expensive and difficult. LiDAR simulation aims at generating realistic LiDAR data with labels for training and verifying self-driving algorithms more efficiently. Recently, Neural Radiance Fields (NeRF) have been proposed for novel view synthesis using implicit reconstruction of 3D scenes. Inspired by this, we present NeRF-LIDAR, a novel LiDAR simulation method that leverages real-world information to generate realistic LIDAR point clouds. Different from existing LiDAR simulators, we use real images and point cloud data collected by self-driving cars to learn the 3D scene representation, point cloud generation and label rendering. We verify the effectiveness of our NeRF-LiDAR by training different 3D segmentation models on the generated LiDAR point clouds. +It reveals that the trained models are able to achieve similar accuracy when compared with the same model trained on the real LiDAR data. Besides, the generated data is capable of boosting the accuracy through pre-training which helps reduce the requirements of the real labeled data. Code is available at https://github.com/fudan-zvg/NeRF-LiDAR \ No newline at end of file diff --git a/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning b/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning new file mode 100644 index 0000000000..21389a881d --- /dev/null +++ b/data/2024/aaai/NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning @@ -0,0 +1 @@ +Neural Radiance Fields (NeRF) have garnered remarkable success in novel view synthesis. Nonetheless, the task of generating high-quality images for novel views persists as a critical challenge. While the existing efforts have exhibited commendable progress, capturing intricate details, enhancing textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics warrant further focused attention and advancement. In this work, we propose NeRF-VPT, an innovative method for novel view synthesis to address these challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning paradigm, wherein RGB information gained from preceding rendering outcomes serves as instructive visual prompts for subsequent rendering stages, with the aspiration that the prior knowledge embedded in the prompts can facilitate the gradual enhancement of rendered image quality. NeRF-VPT only requires sampling RGB data from previous stage renderings as priors at each training stage, without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is plug-and-play and can be readily integrated into existing methods. By conducting comparative analyses of our NeRF-VPT against several NeRF-based approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360, Real Forward-Facing, Replica dataset, and a user-captured dataset, we substantiate that our NeRF-VPT significantly elevates baseline performance and proficiently generates more high-quality novel view images than all the compared state-of-the-art methods. Furthermore, the cascading learning of NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in a significant enhancement of accuracy for sparse-view novel view synthesis. The source code and dataset are available at https://github.com/Freedomcls/NeRF-VPT. \ No newline at end of file diff --git a/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack b/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack new file mode 100644 index 0000000000..3136aa62ef --- /dev/null +++ b/data/2024/aaai/NeRFail: Neural Radiance Fields-Based Multiview Adversarial Attack @@ -0,0 +1 @@ +Adversarial attacks, i.e., generating adversarial perturbations with a small magnitude to deceive deep neural networks, are important for investigating and improving model trustworthiness. Traditionally, the topic was scoped within 2D images without considering 3D multiview information. Benefiting from Neural Radiance Fields (NeRF), one can easily reconstruct a 3D scene with a Multi-Layer Perceptron (MLP) from given 2D views and synthesize photo-realistic renderings of novel vantages. This opens up a door to discussing the possibility of undertaking to attack multiview NeRF network with downstream tasks from different rendering angles, which we denote Neural Radiance Fiels-based multiview adversarial Attack (NeRFail). The goal is, given one scene and a subset of views, to deceive the recognition results of agnostic view angles as well as given views. To do so, we propose a transformation mapping from pixels to 3D points such that our attack generates multiview adversarial perturbations by attacking a subset of images with different views, intending to prevent the downstream classifier from correctly predicting images rendered by NeRF from other views. Experiments show that our multiview adversarial perturbations successfully obfuscate the downstream classifier at both known and unknown views. Notably, when retraining another NeRF on the perturbed training data, we show that the perturbation can be inherited and reproduced. The code can be found at https://github.com/jiang-wenxiang/NeRFail. \ No newline at end of file diff --git a/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification b/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification new file mode 100644 index 0000000000..7a2361e097 --- /dev/null +++ b/data/2024/aaai/NeSyFOLD: A Framework for Interpretable Image Classification @@ -0,0 +1,26 @@ +Deep learning models such as CNNs have surpassed human +performance in computer vision tasks such as image classi- +fication. However, despite their sophistication, these models +lack interpretability which can lead to biased outcomes re- +flecting existing prejudices in the data. We aim to make pre- +dictions made by a CNN interpretable. Hence, we present a +novel framework called NeSyFOLD to create a neurosym- +bolic (NeSy) model for image classification tasks. The model +is a CNN with all layers following the last convolutional layer +replaced by a stratified answer set program (ASP) derived +from the last layer kernels. The answer set program can be +viewed as a rule-set, wherein the truth value of each pred- +icate depends on the activation of the corresponding kernel +in the CNN. The rule-set serves as a global explanation for +the model and is interpretable. We also use our NeSyFOLD +framework with a CNN that is trained using a sparse kernel +learning technique called Elite BackProp (EBP). This leads to +a significant reduction in rule-set size without compromising +accuracy or fidelity thus improving scalability of the NeSy +model and interpretability of its rule-set. Evaluation is done +on datasets with varied complexity and sizes. We also pro- +pose a novel algorithm for labelling the predicates in the rule- +set with meaningful semantic concept(s) learnt by the CNN. +We evaluate the performance of our “semantic labelling algo- +rithm” to quantify the efficacy of the semantic labelling for +both the NeSy model and the NeSy-EBP model. \ No newline at end of file diff --git a/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers b/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers new file mode 100644 index 0000000000..580db2cca1 --- /dev/null +++ b/data/2024/aaai/Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers @@ -0,0 +1 @@ +Byzantine machine learning has garnered considerable attention in light of the unpredictable faults that can occur in large-scale distributed learning systems. The key to secure resilience against Byzantine machines in distributed learning is resilient aggregation mechanisms. Although abundant resilient aggregation rules have been proposed, they are designed in ad-hoc manners, imposing extra barriers on comparing, analyzing, and improving the rules across performance criteria. This paper studies near-optimal aggregation rules using clustering in the presence of outliers. Our outlier-robust clustering approach utilizes geometric properties of the update vectors provided by workers. Our analysis show that constant approximations to the 1-center and 1-mean clustering problems with outliers provide near-optimal resilient aggregators for metric-based criteria, which have been proven to be crucial in the homogeneous and heterogeneous cases respectively. In addition, we discuss two contradicting types of attacks under which no single aggregation rule is guaranteed to improve upon the naive average. Based on the discussion, we propose a two-phase resilient aggregation framework. We run experiments for image classification using a non-convex loss function. The proposed algorithms outperform previously known aggregation rules by a large margin with both homogeneous and heterogeneous data distributions among non-faulty workers. Code and appendix are available at https://github.com/jerry907/AAAI24-RASHB. \ No newline at end of file diff --git a/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity b/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity new file mode 100644 index 0000000000..e525f8420d --- /dev/null +++ b/data/2024/aaai/Nearly Equitable Allocations beyond Additivity and Monotonicity @@ -0,0 +1,3 @@ +Equitability (EQ) in fair division requires that items be allocated such that all agents value the bundle they receive equally. With indivisible items, an equitable allocation may not exist, and hence we instead consider a meaningful analog, EQx, that requires equitability up to any item. EQx allocations exist for monotone, additive valuations. However, if (1) the agents' valuations are not additive or (2) the set of indivisible items includes both goods and chores (positively and negatively valued items), then prior to the current work it was not known whether EQx allocations exist or not. + +We study both the existence and efficient computation of EQx allocations. (1) For monotone valuations (not necessarily additive), we show that EQx allocations always exist. Also, for the large class of weakly well-layered valuations, EQx allocations can be found in polynomial time. Further, we prove that approximately EQx allocations can be computed efficiently under general monotone valuations. (2) For non-monotone valuations, we show that an EQx allocation may not exist, even for two agents with additive valuations. Under some special cases, however, we show existence and efficient computability of EQx allocations. This includes the case of two agents with additive valuations where each item is either a good or a chore, and there are no mixed items. \ No newline at end of file diff --git a/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution b/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution new file mode 100644 index 0000000000..5b2a9b7b75 --- /dev/null +++ b/data/2024/aaai/NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-world Video Super-Resolution @@ -0,0 +1 @@ +The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching b/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching new file mode 100644 index 0000000000..f90d1bb22e --- /dev/null +++ b/data/2024/aaai/Negative Pre-aware for Noisy Cross-Modal Matching @@ -0,0 +1 @@ +Cross-modal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify. Due to the cumulative and unavoidable negative impact of unresolved noise, existing methods cannot maintain a stable performance when the noise increases. In this paper, we present a novel Negative Pre-aware Cross-modal (NPC) matching solution for large visual-language model fine-tuning on noisy downstream tasks. It is featured in two aspects: (1) For noise recognition and resistance, previous methods usually directly filter out a noise subset, we propose to estimate the negative impact of each sample. It does not need additional correction mechanisms that may predict unreliable correction results, leading to self-reinforcing error. We assign a confidence weight to each sample according to its negative impact in the training process. This adaptively adjusts the contribution of each sample to avoid noisy accumulation. (2) For maintaining stable performance with increasing noise, we utilize the memorization effect of DNNs by maintaining a memory bank. Specifically, we apply GMM to select high-confident clean samples as the memory entry, where the memory entry is used to estimate the negative impact of each sample. Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio. Compared to the correction mechanism focusing on noise samples, memory bank-based estimation is more robust, which makes the model performance stable on noisy datasets. Extensive experiments demonstrate that our method significantly improves matching accuracy and performance stability at increasing noise ratio. Our approach also surpasses the state-of-the-art methods by a large margin. The code is available at: https://github.com/ZhangXu0963/NPC. \ No newline at end of file diff --git a/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes b/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes new file mode 100644 index 0000000000..0a7e4084d1 --- /dev/null +++ b/data/2024/aaai/Neighborhood-Enhanced 3D Human Pose Estimation with Monocular LiDAR in Long-Range Outdoor Scenes @@ -0,0 +1 @@ +3D human pose estimation (3HPE) in large-scale outdoor scenes using commercial LiDAR has attracted significant attention due to its potential for real-life applications. However, existing LiDAR-based methods for 3HPE primarily rely on recovering 3D human poses from individual point clouds, and the coherence cues present in the neighborhood are not sufficiently harnessed. In this work, we explore spatial and contexture coherence cues contained in the neighborhood that lead to great performance improvements in 3HPE. Specifically, firstly, we deeply investigate the 3D neighbor in the background (3BN) which serves as a spatial coherence cue for inferring reliable motion since it provides physical laws to limit motion targets. Secondly, we introduce a novel 3D scanning neighbor (3SN) generated during the data collection and 3SN implies structural edge coherence cues. We use 3SN to overcome the degradation of performance and data quality caused by the sparsity-varying properties of LiDAR point clouds. In order to effectively model the complementation between these distinct cues and build consistent temporal relationships across human motions, we propose a new transformer-based module called the CoherenceFuse module. Extensive experiments were conducted on publicly available datasets, namely LidarHuman26M, CIMI4D, SLOPER4D and Waymo Open Dataset v2.0, showcase the superiority and effectiveness of our proposed method. In particular, when compared with LidarCap on the LidarHuman26M dataset, our method demonstrates a reduction of 7.08mm in the average MPJPE metric, along with a decrease of 16.55mm in the MPJPE metric for distances exceeding 25 meters. The code and models are available at https://github.com/jingyi-zhang/Neighborhood-enhanced-LidarCap. \ No newline at end of file diff --git a/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views b/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views new file mode 100644 index 0000000000..6584b934f5 --- /dev/null +++ b/data/2024/aaai/NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views @@ -0,0 +1 @@ +Recently, neural implicit functions have demonstrated remarkable results in the field of multi-view reconstruction. However, most existing methods are tailored for dense views and exhibit unsatisfactory performance when dealing with sparse views. Several latest methods have been proposed for generalizing implicit reconstruction to address the sparse view reconstruction task, but they still suffer from high training costs and are merely valid under carefully selected perspectives. In this paper, we propose a novel sparse view reconstruction framework that leverages on-surface priors to achieve highly faithful surface reconstruction. Specifically, we design several constraints on global geometry alignment and local geometry refinement for jointly optimizing coarse shapes and fine details. To achieve this, we train a neural network to learn a global implicit field from the on-surface points obtained from SfM and then leverage it as a coarse geometric constraint. To exploit local geometric consistency, we project on-surface points onto seen and unseen views, treating the consistent loss of projected features as a fine geometric constraint. The experimental results with DTU and BlendedMVS datasets in two prevalent sparse settings demonstrate significant improvements over the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning b/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning new file mode 100644 index 0000000000..e12b158494 --- /dev/null +++ b/data/2024/aaai/Neural Amortized Inference for Nested Multi-Agent Reasoning @@ -0,0 +1 @@ +Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data b/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data new file mode 100644 index 0000000000..9543d54ef9 --- /dev/null +++ b/data/2024/aaai/Neural Bookmarks: Information Retrieval with Deep Learning and EEG Data @@ -0,0 +1 @@ +In neural memory decoding, a concept being mentally recalled is identified using brain data. Recently, the feasibility of neural memory decoding with EEG data has been demonstrated. Here we propose a new application – neural information retrieval – that uses neural memory decoding to allow a document to be retrieved merely by thinking about it. In this paper we describe neural memory decoding, define the application of neural information retrieval, present experimental results related to the practicality of the application, and discuss issues of deployment and data privacy. \ No newline at end of file diff --git a/data/2024/aaai/Neural Causal Abstractions b/data/2024/aaai/Neural Causal Abstractions new file mode 100644 index 0000000000..d418ef7474 --- /dev/null +++ b/data/2024/aaai/Neural Causal Abstractions @@ -0,0 +1 @@ +The ability of humans to understand the world in terms of cause and effect relationships, as well as their ability to compress information into abstract concepts, are two hallmark features of human intelligence. These two topics have been studied in tandem under the theory of causal abstractions, but it is an open problem how to best leverage abstraction theory in real-world causal inference tasks, where the true model is not known, and limited data is available in most practical settings. In this paper, we focus on a family of causal abstractions constructed by clustering variables and their domains, redefining abstractions to be amenable to individual causal distributions. We show that such abstractions can be learned in practice using Neural Causal Models, allowing us to utilize the deep learning toolkit to solve causal tasks (identification, estimation, sampling) at different levels of abstraction granularity. Finally, we show how representation learning can be used to learn abstractions, which we apply in our experiments to scale causal inferences to high dimensional settings such as with image data. \ No newline at end of file diff --git a/data/2024/aaai/Neural Closure Certificates b/data/2024/aaai/Neural Closure Certificates new file mode 100644 index 0000000000..08a1e92483 --- /dev/null +++ b/data/2024/aaai/Neural Closure Certificates @@ -0,0 +1,10 @@ +Notions of transition invariants and closure certificates have seen recent use in the formal verification of controlled dynamical systems against \omega-regular properties. +Unfortunately, existing approaches face limitations in two directions. +First, they require a closed-form mathematical expression representing the model of the system. +Such an expression may be difficult to find, too complex to be of any use, or unavailable due to security or privacy constraints. +Second, finding such invariants typically rely on optimization techniques such as sum-of-squares (SOS) or satisfiability modulo theory (SMT) solvers. +This restricts the classes of systems that need to be formally verified. +To address these drawbacks, we introduce a notion of neural closure certificates. +We present a data-driven algorithm that trains a neural network to represent a closure certificate. +Our approach is formally correct under some mild assumptions, i.e., one is able to formally show that the unknown system satisfies the \omega-regular property of interest if a neural closure certificate can be computed. +Finally, we demonstrate the efficacy of our approach with relevant case studies. \ No newline at end of file diff --git a/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning b/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning new file mode 100644 index 0000000000..60de54faa5 --- /dev/null +++ b/data/2024/aaai/Neural Gaussian Similarity Modeling for Differential Graph Structure Learning @@ -0,0 +1 @@ +Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling nearest neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods. \ No newline at end of file diff --git a/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning b/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning new file mode 100644 index 0000000000..b259377094 --- /dev/null +++ b/data/2024/aaai/Neural Network Approximation for Pessimistic Offline Reinforcement Learning @@ -0,0 +1 @@ +Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with C-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for C-mixing sequences and the neural network approximation theory for the Holder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design. \ No newline at end of file diff --git a/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits b/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits new file mode 100644 index 0000000000..8bc3601182 --- /dev/null +++ b/data/2024/aaai/Neural Network Approximators for Marginal MAP in Probabilistic Circuits @@ -0,0 +1 @@ +Probabilistic circuits (PCs) such as sum-product networks efficiently represent large multi-variate probability distributions. They are preferred in practice over other probabilistic representations, such as Bayesian and Markov networks, because PCs can solve marginal inference (MAR) tasks in time that scales linearly in the size of the network. Unfortunately, the most probable explanation (MPE) task and its generalization, the marginal maximum-a-posteriori (MMAP) inference task remain NP-hard in these models. Inspired by the recent work on using neural networks for generating near-optimal solutions to optimization problems such as integer linear programming, we propose an approach that uses neural networks to approximate MMAP inference in PCs. The key idea in our approach is to approximate the cost of an assignment to the query variables using a continuous multilinear function and then use the latter as a loss function. The two main benefits of our new method are that it is self-supervised, and after the neural network is learned, it requires only linear time to output a solution. We evaluate our new approach on several benchmark datasets and show that it outperforms three competing linear time approximations: max-product inference, max-marginal inference, and sequential estimation, which are used in practice to solve MMAP tasks in PCs. \ No newline at end of file diff --git a/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning b/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning new file mode 100644 index 0000000000..c9770eca92 --- /dev/null +++ b/data/2024/aaai/Neural Oscillators for Generalization of Physics-Informed Machine Learning @@ -0,0 +1 @@ +A primary challenge of physics-informed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data. \ No newline at end of file diff --git a/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding b/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding new file mode 100644 index 0000000000..7f08e1118f --- /dev/null +++ b/data/2024/aaai/Neural Physical Simulation with Multi-Resolution Hash Grid Encoding @@ -0,0 +1 @@ +We explore the generalization of the implicit representation in the physical simulation task. Traditional time-dependent partial differential equations (PDEs) solvers for physical simulation often adopt the grid or mesh for spatial discretization, which is memory-consuming for high resolution and lack of adaptivity. Many implicit representations like local extreme machine or Siren are proposed but they are still too compact to suffer from limited accuracy in handling local details and a long time of convergence. We contribute a neural simulation framework based on multi-resolution hash grid representation to introduce hierarchical consideration of global and local information, simultaneously. Furthermore, we propose two key strategies: 1) a numerical gradient method for computing high-order derivatives with boundary conditions; 2) a range analysis sample method for fast neural geometry boundary sampling with dynamic topologies. Our method shows much higher accuracy and strong flexibility for various simulation problems: e.g., large elastic deformations, complex fluid dynamics, and multi-scale phenomena which remain challenging for existing neural physical solvers. \ No newline at end of file diff --git a/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions b/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions new file mode 100644 index 0000000000..bebf62a873 --- /dev/null +++ b/data/2024/aaai/Neural Reasoning about Agents' Goals, Preferences, and Actions @@ -0,0 +1 @@ +We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks - with up to 48.9% improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks. \ No newline at end of file diff --git a/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation b/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation new file mode 100644 index 0000000000..cdf7fd8937 --- /dev/null +++ b/data/2024/aaai/Neural Time-Reversed Generalized Riccati Equation @@ -0,0 +1 @@ +Optimal control deals with optimization problems in which variables steer a dynamical system, and its outcome contributes to the objective function. Two classical approaches to solving these problems are Dynamic Programming and the Pontryagin Maximum Principle. In both approaches, Hamiltonian equations offer an interpretation of optimality through auxiliary variables known as costates. However, Hamiltonian equations are rarely used due to their reliance on forward-backward algorithms across the entire temporal domain. This paper introduces a novel neural-based approach to optimal control. Neural networks are employed not only for implementing state dynamics but also for estimating costate variables. The parameters of the latter network are determined at each time step using a newly introduced local policy referred to as the time-reversed generalized Riccati equation. This policy is inspired by a result discussed in the Linear Quadratic (LQ) problem, which we conjecture stabilizes state dynamics. We support this conjecture by discussing experimental results from a range of optimal control case studies. \ No newline at end of file diff --git a/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs b/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs new file mode 100644 index 0000000000..b653c8363a --- /dev/null +++ b/data/2024/aaai/Neuro-Symbolic Integration for Reasoning and Learning on Knowledge Graphs @@ -0,0 +1 @@ +The goal of this thesis is to address knowledge graph completion tasks using neuro-symbolic methods. Neuro-symbolic methods allow the joint utilization of symbolic information defined as meta-rules in ontologies and knowledge graph embedding methods that represent entities and relations of the graph in a low-dimensional vector space. This approach has the potential to improve the resolution of knowledge graph completion tasks in terms of reliability, interpretability, data-efficiency and robustness. \ No newline at end of file diff --git a/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract) b/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract) new file mode 100644 index 0000000000..b78d81259a --- /dev/null +++ b/data/2024/aaai/Neuroevolution of a Multi-Generator GAN (Student Abstract) @@ -0,0 +1 @@ +Evolutionary Algorithms (EA) have been leveraged to tackle the challenges faced while using GANs such as mode collapse, vanishing gradient, latent space search, etc. However, the existing techniques of using EA with GANs operate backpropagation and EA in isolation from each other, leaving ample room for further exploration. This paper creates a collaborative bridge between EA and GANs by exploring a neuroevolution method for utilising both EA and backpropagation-based optimisation, simultaneously, for a multi-generator GAN architecture. Experiments conducted using a standard dataset with variants of the proposed method highlight the towering impact of each of the components involved in the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining b/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining new file mode 100644 index 0000000000..65e60e9b01 --- /dev/null +++ b/data/2024/aaai/Neuromorphic Event Signal-Driven Network for Video De-raining @@ -0,0 +1 @@ +Convolutional neural networks-based video de-raining methods commonly rely on dense intensity frames captured by CMOS sensors. However, the limited temporal resolution of these sensors hinders the capture of dynamic rainfall information, limiting further improvement in de-raining performance. This study aims to overcome this issue by incorporating the neuromorphic event signal into the video de-raining to enhance the dynamic information perception. Specifically, we first utilize the dynamic information from the event signal as prior knowledge, and integrate it into existing de-raining objectives to better constrain the solution space. We then design an optimization algorithm to solve the objective, and construct a de-raining network with CNNs as the backbone architecture using a modular strategy to mimic the optimization process. To further explore the temporal correlation of the event signal, we incorporate a spiking self-attention module into our network. By leveraging the low latency and high temporal resolution of the event signal, along with the spatial and temporal representation capabilities of convolutional and spiking neural networks, our model captures more accurate dynamic information and significantly improves de-raining performance. For example, our network achieves a 1.24dB improvement on the SynHeavy25 dataset compared to the previous state-of-the-art method, while utilizing only 39% of the parameters. \ No newline at end of file diff --git a/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem b/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem new file mode 100644 index 0000000000..62e2e67431 --- /dev/null +++ b/data/2024/aaai/New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem @@ -0,0 +1 @@ +We consider the sparse contextual bandit problem where arm feature affects reward through the inner product of sparse parameters. Recent studies have developed sparsity-agnostic algorithms based on the greedy arm selection policy. However, the analysis of these algorithms requires strong assumptions on the arm feature distribution to ensure that the greedily selected samples are sufficiently diverse; One of the most common assumptions, relaxed symmetry, imposes approximate origin-symmetry on the distribution, which cannot allow distributions that has origin-asymmetric support. In this paper, we show that the greedy algorithm is applicable to a wider range of the arm feature distributions from two aspects. Firstly, we show that a mixture distribution that has a greedy-applicable component is also greedy-applicable. Second, we propose new distribution classes, related to Gaussian mixture, discrete, and radial distribution, for which the sample diversity is guaranteed. The proposed classes can describe distributions with origin-asymmetric support and, in conjunction with the first claim, provide theoretical guarantees of the greedy policy for a very wide range of the arm feature distributions. \ No newline at end of file diff --git a/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction b/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction new file mode 100644 index 0000000000..58ccfe0161 --- /dev/null +++ b/data/2024/aaai/NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction @@ -0,0 +1 @@ +Existing deep-learning-based methods for nighttime video deraining rely on synthetic data due to the absence of real-world paired data. However, the intricacies of the real world, particularly with the presence of light effects and low-light regions affected by noise, create significant domain gaps, hampering synthetic-trained models in removing rain streaks properly and leading to over-saturation and color shifts. Motivated by this, we introduce NightRain, a novel nighttime video deraining method with adaptive-rain-removal and adaptive-correction. Our adaptive-rain-removal uses unlabeled rain videos to enable our model to derain real-world rain videos, particularly in regions affected by complex light effects. The idea is to allow our model to obtain rain-free regions based on the confidence scores. Once rain-free regions and the corresponding regions from our input are obtained, we can have region-based paired real data. These paired data are used to train our model using a teacher-student framework, allowing the model to iteratively learn from less challenging regions to more challenging regions. Our adaptive-correction aims to rectify errors in our model's predictions, such as over-saturation and color shifts. The idea is to learn from clear night input training videos based on the differences or distance between those input videos and their corresponding predictions. Our model learns from these differences, compelling our model to correct the errors. From extensive experiments, our method demonstrates state-of-the-art performance. It achieves a PSNR of 26.73dB, surpassing existing nighttime video deraining methods by a substantial margin of 13.7%. \ No newline at end of file diff --git a/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers b/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers new file mode 100644 index 0000000000..98c9934837 --- /dev/null +++ b/data/2024/aaai/No Head Left Behind - Multi-Head Alignment Distillation for Transformers @@ -0,0 +1 @@ +Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher's first attention head instructs the Student's first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models. \ No newline at end of file diff --git a/data/2024/aaai/No Internal Regret with Non-convex Loss Functions b/data/2024/aaai/No Internal Regret with Non-convex Loss Functions new file mode 100644 index 0000000000..f0d4cea18b --- /dev/null +++ b/data/2024/aaai/No Internal Regret with Non-convex Loss Functions @@ -0,0 +1 @@ +Internal regret is a measure of performance of an online learning algorithm, which measures the change in performance by substituting every occurrence of a given action i by an alternative action j. Algorithms for minimizing internal regret are known for the finite experts setting, including a general reduction to the problem of minimizing external regret for this case. The reduction however crucially depends on the finiteness of the action space. In this work we approach the problem of minimizing internal regret for a continuous action space. For the full information setting, we show how to obtain O(sqrt(T)) internal regret for the class of Lipschitz functions, as well as non-Lipschitz dispersed functions, i.e. the non-Lipschitzness may not concentrate in a small region of the action space. We also consider extensions to partial feedback settings, and again obtain sublinear internal regret. Finally we discuss applications of internal regret minimization over continuous spaces to correlated equilibria in pricing problems and auction design, as well as to data-driven hyperparameter tuning. \ No newline at end of file diff --git a/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision b/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision new file mode 100644 index 0000000000..29777ffdc4 --- /dev/null +++ b/data/2024/aaai/No More Shortcuts: Realizing the Potential of Temporal Self-Supervision @@ -0,0 +1 @@ +Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage. \ No newline at end of file diff --git a/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation b/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation new file mode 100644 index 0000000000..64222c9144 --- /dev/null +++ b/data/2024/aaai/No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation @@ -0,0 +1 @@ +Ensuring fairness in Recommendation Systems (RSs) across demographic groups is critical due to the increased integration of RSs in applications such as personalized healthcare, finance, and e-commerce. Graph-based RSs play a crucial role in capturing intricate higher-order interactions among entities. However, integrating these graph models into the Federated Learning (FL) paradigm with fairness constraints poses formidable challenges as this requires access to the entire interaction graph and sensitive user information (such as gender, age, etc.) at the central server. This paper addresses the pervasive issue of inherent bias within RSs for different demographic groups without compromising the privacy of sensitive user attributes in FL environment with the graph-based model. To address the group bias, we propose F2PGNN (Fair Federated Personalized Graph Neural Network), a novel framework that leverages the power of Personalized Graph Neural Network (GNN) coupled with fairness considerations. Additionally, we use differential privacy techniques to fortify privacy protection. Experimental evaluation on three publicly available datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47% ∼ 99% compared to the state-of-the-art while preserving privacy and maintaining the utility. The results validate the significance of our framework in achieving equitable and personalized recommendations using GNN within the FL landscape. Source code is at: https://github.com/nimeshagrawal/F2PGNN-AAAI24 \ No newline at end of file diff --git a/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning b/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning new file mode 100644 index 0000000000..7ee8eff6aa --- /dev/null +++ b/data/2024/aaai/No Prior Mask: Eliminate Redundant Action for Deep Reinforcement Learning @@ -0,0 +1 @@ +The large action space is one fundamental obstacle to deploying Reinforcement Learning methods in the real world. The numerous redundant actions will cause the agents to make repeated or invalid attempts, even leading to task failure. Although current algorithms conduct some initial explorations for this issue, they either suffer from rule-based systems or depend on expert demonstrations, which significantly limits their applicability in many real-world settings. In this work, we examine the theoretical analysis of what action can be eliminated in policy optimization and propose a novel redundant action filtering mechanism. Unlike other works, our method constructs the similarity factor by estimating the distance between the state distributions, which requires no prior knowledge. In addition, we combine the modified inverse model to avoid extensive computation in high-dimensional state space. We reveal the underlying structure of action spaces and propose a simple yet efficient redundant action filtering mechanism named No Prior Mask (NPM) based on the above techniques. We show the superior performance of our method by conducting extensive experiments on high-dimensional, pixel-input, and stochastic problems with various action redundancy tasks. Our code is public online at https://github.com/zhongdy15/npm. \ No newline at end of file diff --git a/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words b/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words new file mode 100644 index 0000000000..add3d40ca0 --- /dev/null +++ b/data/2024/aaai/Noise-Aware Image Captioning with Progressively Exploring Mismatched Words @@ -0,0 +1 @@ +Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution b/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution new file mode 100644 index 0000000000..b4e55921f5 --- /dev/null +++ b/data/2024/aaai/Noise-Free Optimization in Early Training Steps for Image Super-resolution @@ -0,0 +1 @@ +Recent deep-learning-based single image super-resolution (SISR) methods have shown impressive performance whereas typical methods train their networks by minimizing the pixel-wise distance with respect to a given high-resolution (HR) image. However, despite the basic training scheme being the predominant choice, its use in the context of ill-posed inverse problems has not been thoroughly investigated. In this work, we aim to provide a better comprehension of the underlying constituent by decomposing target HR images into two subcomponents: (1) the optimal centroid which is the expectation over multiple potential HR images, and (2) the inherent noise defined as the residual between the HR image and the centroid. Our findings show that the current training scheme cannot capture the ill-posed nature of SISR and becomes vulnerable to the inherent noise term, especially during early training steps. To tackle this issue, we propose a novel optimization method that can effectively remove the inherent noise term in the early steps of vanilla training by estimating the optimal centroid and directly optimizing toward the estimation. Experimental results show that the proposed method can effectively enhance the stability of vanilla training, leading to overall performance gain. Codes are available at github.com/2minkyulee/ECO. \ No newline at end of file diff --git a/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation b/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation new file mode 100644 index 0000000000..7e8fff3ff5 --- /dev/null +++ b/data/2024/aaai/Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation @@ -0,0 +1,3 @@ +Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. +However, it inevitably includes mismatched pairs, i.e., noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on similarity-guided training with hard negatives and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely Self-Reinforcing Errors Mitigation (SREM). +Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM. \ No newline at end of file diff --git a/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias b/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias new file mode 100644 index 0000000000..db270b8afe --- /dev/null +++ b/data/2024/aaai/Non-exemplar Domain Incremental Object Detection via Learning Domain Bias @@ -0,0 +1 @@ +Domain incremental object detection (DIOD) aims to gradually learn a unified object detection model from a dataset stream composed of different domains, achieving good performance in all encountered domains. The most critical obstacle to this goal is the catastrophic forgetting problem, where the performance of the model improves rapidly in new domains but deteriorates sharply in old ones after a few sessions. To address this problem, we propose a non-exemplar DIOD method named learning domain bias (LDB), which learns domain bias independently at each new session, avoiding saving examples from old domains. Concretely, a base model is first obtained through training during session 1. Then, LDB freezes the weights of the base model and trains individual domain bias for each new incoming domain, adapting the base model to the distribution of new domains. At test time, since the domain ID is unknown, we propose a domain selector based on nearest mean classifier (NMC), which selects the most appropriate domain bias for a test image. Extensive experimental evaluations on two series of datasets demonstrate the effectiveness of the proposed LDB method in achieving high accuracy on new and old domain datasets. The code is available at https://github.com/SONGX1997/LDB. \ No newline at end of file diff --git a/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement b/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement new file mode 100644 index 0000000000..469a21d14a --- /dev/null +++ b/data/2024/aaai/Non-exemplar Online Class-Incremental Continual Learning via Dual-Prototype Self-Augment and Refinement @@ -0,0 +1 @@ +This paper investigates a new, practical, but challenging problem named Non-exemplar Online Class-incremental continual Learning (NO-CL), which aims to preserve the discernibility of base classes without buffering data examples and efficiently learn novel classes continuously in a single-pass (i.e., online) data stream. The challenges of this task are mainly two-fold: (1) Both base and novel classes suffer from severe catastrophic forgetting as no previous samples are available for replay. (2) As the online data can only be observed once, there is no way to fully re-train the whole model, e.g., re-calibrate the decision boundaries via prototype alignment or feature distillation. In this paper, we propose a novel Dual-prototype Self-augment and Refinement method (DSR) for NO-CL problem, which consists of two strategies: 1) Dual class prototypes: vanilla and high-dimensional prototypes are exploited to utilize the pre-trained information and obtain robust quasi-orthogonal representations rather than example buffers for both privacy preservation and memory reduction. 2) Self-augment and refinement: Instead of updating the whole network, we optimize high-dimensional prototypes alternatively with the extra projection module based on self-augment vanilla prototypes, through a bi-level optimization problem. Extensive experiments demonstrate the effectiveness and superiority of the proposed DSR in NO-CL. \ No newline at end of file diff --git a/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation b/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation new file mode 100644 index 0000000000..97a5483941 --- /dev/null +++ b/data/2024/aaai/Non-flat ABA Is an Instance of Bipolar Argumentation @@ -0,0 +1,3 @@ +Assumption-based Argumentation (ABA) is a well-known structured argumentation formalism, whereby arguments and attacks between them are drawn from rules, defeasible assumptions and their contraries. +A common restriction imposed on ABA frameworks (ABAFs) is that they are flat, i.e. each of the defeasible assumptions can only be assumed, but not derived. While it is known that flat ABAFs can be translated into abstract argumentation frameworks (AFs) as proposed by Dung, no translation exists from general, possibly non-flat ABAFs into any kind of abstract argumentation formalism. +In this paper, we close this gap and show that bipolar AFs (BAFs) can instantiate general ABAFs. To this end we develop suitable, novel BAF semantics which borrow from the notion of deductive support. We investigate basic properties of our BAFs, including computational complexity, and prove the desired relation to ABAFs under several semantics. \ No newline at end of file diff --git a/data/2024/aaai/Non-monotone Sequential Submodular Maximization b/data/2024/aaai/Non-monotone Sequential Submodular Maximization new file mode 100644 index 0000000000..def9462eb9 --- /dev/null +++ b/data/2024/aaai/Non-monotone Sequential Submodular Maximization @@ -0,0 +1,2 @@ +In this paper, we study a fundamental problem in submodular optimization known as sequential submodular maximization. The primary objective of this problem is to select and rank a sequence of items to optimize a group of submodular functions. +The existing research on this problem has predominantly concentrated on the monotone setting, assuming that the submodular functions are non-decreasing. However, in various real-world scenarios, like diversity-aware recommendation systems, adding items to an existing set might negatively impact the overall utility. In response, we propose to study this problem with non-monotone submodular functions and develop approximation algorithms for both flexible and fixed length constraints, as well as a special case with identical utility functions. The empirical evaluations further validate the effectiveness of our proposed algorithms in the domain of video recommendations. \ No newline at end of file diff --git a/data/2024/aaai/Non-parametric Representation Learning with Kernels b/data/2024/aaai/Non-parametric Representation Learning with Kernels new file mode 100644 index 0000000000..b1e6dfdd3f --- /dev/null +++ b/data/2024/aaai/Non-parametric Representation Learning with Kernels @@ -0,0 +1 @@ +Unsupervised and self-supervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernel-based representation learning approaches: Firstly, we define two kernel Self-Supervised Learning (SSL) models using contrastive loss functions and secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and reconstructing data. We argue that the classical representer theorems for supervised kernel machines are not always applicable for (self-supervised) representation learning, and present new representer theorems, which show that the representations learned by our kernel models can be expressed in terms of kernel matrices. We further derive generalisation error bounds for representation learning with kernel SSL and AE, and empirically evaluate the performance of these methods in both small data regimes as well as in comparison with neural network based models. \ No newline at end of file diff --git a/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees b/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees new file mode 100644 index 0000000000..2d86096d24 --- /dev/null +++ b/data/2024/aaai/Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees @@ -0,0 +1 @@ +Projection-free online learning has drawn increasing interest due to its efficiency in solving high-dimensional problems with complicated constraints. However, most existing projection-free online methods focus on minimizing the static regret, which unfortunately fails to capture the challenge of changing environments. In this paper, we investigate non-stationary projection-free online learning, and choose dynamic regret and adaptive regret to measure the performance. Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named BOGD_IP, and establish an O(T^¾ (1+P_T)) dynamic regret bound, where P_T denotes the path-length of the comparator sequence. Then, we improve the upper bound to O(T^¾ (1+P_T)^¼) by running multiple BOGD_IP algorithms with different step sizes in parallel, and tracking the best one on the fly. Our results are the first general-case dynamic regret bounds for projection-free online learning, and can recover the existing O(T^¾) static regret by setting P_T = 0. Furthermore, we propose a projection-free method to attain an O(?^¾) adaptive regret bound for any interval with length ?, which nearly matches the static regret over that interval. The essential idea is to maintain a set of BOGD_IP algorithms dynamically, and combine them by a meta algorithm. Moreover, we demonstrate that it is also equipped with an O(T^¾ (1+P_T)^¼) dynamic regret bound. Finally, empirical studies verify our theoretical findings. \ No newline at end of file diff --git a/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching b/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching new file mode 100644 index 0000000000..dc5cd8591b --- /dev/null +++ b/data/2024/aaai/NondBREM: Nondeterministic Offline Reinforcement Learning for Large-Scale Order Dispatching @@ -0,0 +1 @@ +One of the most important tasks in ride-hailing is order dispatching, i.e., assigning unserved orders to available drivers. Recent order dispatching has achieved a significant improvement due to the advance of reinforcement learning, which has been approved to be able to effectively address sequential decision-making problems like order dispatching. However, most existing reinforcement learning methods require agents to learn the optimal policy by interacting with environments online, which is challenging or impractical for real-world deployment due to high costs or safety concerns. For example, due to the spatiotemporally unbalanced supply and demand, online reinforcement learning-based order dispatching may significantly impact the revenue of the ride-hailing platform and passenger experience during the policy learning period. Hence, in this work, we develop an offline deep reinforcement learning framework called NondBREM for large-scale order dispatching, which learns policy from only the accumulated logged data to avoid costly and unsafe interactions with the environment. In NondBREM, a Nondeterministic Batch-Constrained Q-learning (NondBCQ) module is developed to reduce the algorithm extrapolation error and a Random Ensemble Mixture (REM) module that integrates multiple value networks with multi-head networks is utilized to improve the model generalization and robustness. Extensive experiments on large-scale real-world ride-hailing datasets show the superiority of our design. \ No newline at end of file diff --git a/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models b/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models new file mode 100644 index 0000000000..930bc84e44 --- /dev/null +++ b/data/2024/aaai/Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models @@ -0,0 +1 @@ +As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications. \ No newline at end of file diff --git a/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract) b/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract) new file mode 100644 index 0000000000..c90b763137 --- /dev/null +++ b/data/2024/aaai/Novax or Novak? Estimating Social Media Stance towards Celebrity Vaccine Hesitancy (Student Abstract) @@ -0,0 +1 @@ +On 15 January 2022, noted tennis player Novak Djokovic was deported from Australia due to his unvaccinated status for the COVID-19 vaccine. This paper presents a stance classifier and evaluates public reaction to this episode and the impact of this behavior on social media discourse on YouTube. We observed a significant spike of individuals who supported and opposed his behavior at the time of the episode. Supporters outnumbered those who opposed this behavior by over 4x. Our study reports a disturbing trend that following every major Djokovic win, even now, vaccine skeptics often conflate his tennis success as a fitting reply to vaccine mandates. \ No newline at end of file diff --git a/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract) b/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract) new file mode 100644 index 0000000000..c5781ef75d --- /dev/null +++ b/data/2024/aaai/Novel Class Discovery for Representation of Real-World Heritage Data as Neural Radiance Fields (Student Abstract) @@ -0,0 +1 @@ +Neural Radiance Fields (NeRF) have been extensively explored as a leading approach for modeling and representing 3D data across various domains. Their ability to capture arbitrary scale point clouds and generate novel views makes them particularly valuable for digitizing cultural heritage sites. However, despite their impressive rendering capabilities, prior methods have often overlooked a significant real-world challenge: handling open-world scenarios characterized by unstructured data containing multiple classes in a single set of unlabeled images. To address this challenge, we propose a novel method NCD-NeRF that leverages Novel-Class Discovery to effectively tackle the complexities inherent in real-world data with unlabeled classes while excelling in producing high-quality NeRF representation. To validate our approach, we conducted a benchmarking analysis using a custom-collected dataset featuring UNESCO World Heritage sites in India. We observe that our proposed NCD-NeRF can parallely discover novel classes and render high-quality 3D volumes. \ No newline at end of file diff --git a/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text b/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text new file mode 100644 index 0000000000..c17a337b09 --- /dev/null +++ b/data/2024/aaai/Novel Class Discovery in Chest X-rays via Paired Images and Text @@ -0,0 +1 @@ +Novel class discover(NCD) aims to identify new classes undefined during model training phase with the help of knowledge of known classes. Many methods have been proposed and notably boosted performance of NCD in natural images. However, there has been no work done in discovering new classes based on medical images and disease categories, which is crucial for understanding and diagnosing specific diseases. Moreover, most of the existing methods only utilize information from image modality and use labels as the only supervisory information. In this paper, we propose a multi-modal novel class discovery method based on paired images and text, inspired by the low classification accuracy of chest X-ray images and the relatively higher accuracy of the paired text. Specifically, we first pretrain the image encoder and text encoder with multi-modal contrastive learning on the entire dataset and then we generate pseudo-labels separately on the image branch and text branch. We utilize intra-modal consistency to assess the quality of pseudo-labels and adjust the weights of the pseudo-labels from both branches to generate the ultimate pseudo-labels for training. Experiments on eight subset splits of MIMIC-CXR-JPG dataset show that our method improves the clustering performance of unlabeled classes by about 10% on average compared to state-of-the-art methods. Code is available at: https://github.com/zzzzzzzzjy/MMNCD-main. \ No newline at end of file diff --git a/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning b/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning new file mode 100644 index 0000000000..336c95c452 --- /dev/null +++ b/data/2024/aaai/Novelty vs. Potential Heuristics: A Comparison of Hardness Measures for Satisficing Planning @@ -0,0 +1,4 @@ +Classical planning considers a given task and searches for a plan to solve it. Some tasks are harder to solve than others. We can measure the 'hardness' of a task with the novelty width and the correlation complexity. In this work, we compare these measures. +Additionally, we introduce the river measure, a new measure that is based on potential heuristics and therefore similar to the correlation complexity but also comparable to the novelty width. +We show that the river measure is upper bounded by the correlation complexity and by the novelty width +1. +Furthermore, we show that we can convert a planning task with a polynomial blowup of the task size to ensure that a heuristic of dimension 2 exists that gives rise to backtrack-free search. \ No newline at end of file diff --git a/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys b/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys new file mode 100644 index 0000000000..0e70fee98e --- /dev/null +++ b/data/2024/aaai/Nowcasting Temporal Trends Using Indirect Surveys @@ -0,0 +1 @@ +Indirect surveys, in which respondents provide information about other people they know, have been proposed for estimating (nowcasting) the size of a hidden population where privacy is important or the hidden population is hard to reach. Examples include estimating casualties in an earthquake, conditions among female sex workers, and the prevalence of drug use and infectious diseases. The Network Scale-up Method (NSUM) is the classical approach to developing estimates from indirect surveys, but it was designed for one-shot surveys. Further, it requires certain assumptions and asking for or estimating the number of individuals in each respondent's network. In recent years, surveys have been increasingly deployed online and can collect data continuously (e.g., COVID-19 surveys on Facebook during much of the pandemic). Conventional NSUM can be applied to these scenarios by analyzing the data independently at each point in time, but this misses the opportunity of leveraging the temporal dimension. We propose to use the responses from indirect surveys collected over time and develop analytical tools (i) to prove that indirect surveys can provide better estimates for the trends of the hidden population over time, as compared to direct surveys and (ii) to identify appropriate temporal aggregations to improve the estimates. We demonstrate through extensive simulations that our approach outperforms traditional NSUM and direct surveying methods. We also empirically demonstrate the superiority of our approach on a real indirect survey dataset of COVID-19 cases. \ No newline at end of file diff --git a/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario b/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario new file mode 100644 index 0000000000..d1cafbe75e --- /dev/null +++ b/data/2024/aaai/NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario @@ -0,0 +1 @@ +We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA. \ No newline at end of file diff --git a/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction b/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction new file mode 100644 index 0000000000..a8df0254a1 --- /dev/null +++ b/data/2024/aaai/Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction @@ -0,0 +1 @@ +Consistency and interpretability have long been the critical issues in MRI reconstruction. While interpretability has been dramatically improved with the employment of deep unfolding networks (DUNs), current methods still suffer from inconsistencies and generate inferior anatomical structure. Especially in multi-contrast scenes, different imaging protocols often exacerbate the concerned issue. In this paper, we propose a range-null decomposition-assisted DUN architecture to ensure consistency while still providing desirable interpretability. Given the input decomposed, we argue that the inconsistency could be analytically relieved by feeding solely the null-space component into proximal mapping, while leaving the range-space counterpart fixed. More importantly, a correlation decoupling scheme is further proposed to narrow the information gap for multi-contrast fusion, which dynamically borrows isotropic features from the opponent while maintaining the modality-specific ones. Specifically, the two features are attached to different frequencies and learned individually by the newly designed isotropy encoder and anisotropy encoder. The former strives for the contrast-shared information, while the latter serves to capture the contrast-specific features. The quantitative and qualitative results show that our proposal outperforms most cutting-edge methods by a large margin. Codes will be released on https://github.com/chenjiachengzzz/RNU. \ No newline at end of file diff --git a/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning b/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning new file mode 100644 index 0000000000..ca751a8783 --- /dev/null +++ b/data/2024/aaai/OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning @@ -0,0 +1,7 @@ +Model-based offline reinforcement learning (RL) algorithms have emerged as a promising paradigm for offline RL. +These algorithms usually learn a dynamics model from a static dataset of transitions, use the model to generate synthetic trajectories, and perform conservative policy optimization within these trajectories. +However, our observations indicate that policy optimization methods used in these model-based offline RL algorithms are not effective at exploring the learned model and induce biased exploration, which ultimately impairs the performance of the algorithm. +To address this issue, we propose Offline Conservative ExplorAtioN (OCEAN), a novel rollout approach to model-based offline RL. +In our method, we incorporate additional exploration techniques and introduce three conservative constraints based on uncertainty estimation to mitigate the potential impact of significant dynamic errors resulting from exploratory transitions. +Our work is a plug-in method and can be combined with classical model-based RL algorithms, such as MOPO, COMBO, and RAMBO. +Experiment results of our method on the D4RL MuJoCo benchmark show that OCEAN significantly improves the performance of existing algorithms. \ No newline at end of file diff --git a/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking b/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking new file mode 100644 index 0000000000..d3046832cf --- /dev/null +++ b/data/2024/aaai/ODTrack: Online Dense Temporal Token Learning for Visual Tracking @@ -0,0 +1 @@ +Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named ODTrack, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new SOTA performance on seven benchmarks, while running at real-time speed. Code and models are available at https://github.com/GXNU-ZhongLab/ODTrack. \ No newline at end of file diff --git a/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis b/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis new file mode 100644 index 0000000000..df753354f4 --- /dev/null +++ b/data/2024/aaai/ORES: Open-Vocabulary Responsible Visual Synthesis @@ -0,0 +1 @@ +Avoiding synthesizing specific visual concepts is an essential challenge in responsible visual synthesis. However, the visual concept that needs to be avoided for responsible visual synthesis tends to be diverse, depending on the region, context, and usage scenarios. In this work, we formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts while allowing users to input any desired content. To address this problem, we present a Two-stage Intervention (TIN) framework. By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion synthesis model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible. To evaluate on ORES, we provide a publicly available dataset, baseline models, and benchmark. Experimental results demonstrate the effectiveness of our method in reducing risks of image generation. Our work highlights the potential of LLMs in responsible visual synthesis. Our code and dataset is public available in https://github.com/kodenii/ORES. \ No newline at end of file diff --git a/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution b/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution new file mode 100644 index 0000000000..436e04976b --- /dev/null +++ b/data/2024/aaai/OSFFNet: Omni-Stage Feature Fusion Network for Lightweight Image Super-Resolution @@ -0,0 +1 @@ +Recently, several lightweight methods have been proposed to implement single-image super-resolution (SISR) on resource-constrained devices. However, these methods primarily focus on simplifying network structures without the full utilization of shallow features. The fact remains that shallow features encompass crucial details for the super-resolution task, including edges, textures, and colors. Therefore, developing a novel architecture that can effectively integrate features from different levels and capitalize on their mutual complementarity is necessary. We first analyze the relationship between multi-stage features and the restoration tasks in a classic lightweight SR method. Based on these observations, we propose an Omni-Stage Feature Fusion (OSFF) architecture, which incorporates Original Image Stacked Initialisation, Shallow Feature Global Connection, and Multi-Receptive Field Dynamic Fusion. An Attention-Enhanced Feature Distillation module is also designed to enhance the model performance. Finally, leveraging these contributions, we construct an Omni-Stage Feature Fusion Network (OSFFNet). Through extensive experiments on various benchmark datasets, the proposed model outperforms state-of-the-art methods. Notably, it achieves a 0.26dB PSNR improvement over the second-best method for x2 SR on the Urban100 dataset. \ No newline at end of file diff --git a/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples b/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples new file mode 100644 index 0000000000..34731baf14 --- /dev/null +++ b/data/2024/aaai/OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples @@ -0,0 +1 @@ +Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection. \ No newline at end of file diff --git a/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments b/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments new file mode 100644 index 0000000000..1456f65db8 --- /dev/null +++ b/data/2024/aaai/OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments @@ -0,0 +1 @@ +In reinforcement learning, the optimism in the face of uncertainty (OFU) is a mainstream principle for directing exploration towards less explored areas, characterized by higher uncertainty. However, in the presence of environmental stochasticity (noise), purely optimistic exploration may lead to excessive probing of high-noise areas, consequently impeding exploration efficiency. Hence, in exploring noisy environments, while optimism-driven exploration serves as a foundation, prudent attention to alleviating unnecessary over-exploration in high-noise areas becomes beneficial. In this work, we propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control. OVD-Explorer proposes a new measurement of the policy's exploration ability considering noise in optimistic perspectives, and leverages gradient ascent to drive exploration. Practically, OVD-Explorer can be easily integrated with continuous control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic exploration. \ No newline at end of file diff --git a/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models b/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models new file mode 100644 index 0000000000..18380cccc4 --- /dev/null +++ b/data/2024/aaai/OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq. \ No newline at end of file diff --git a/data/2024/aaai/Object Attribute Matters in Visual Question Answering b/data/2024/aaai/Object Attribute Matters in Visual Question Answering new file mode 100644 index 0000000000..989938e294 --- /dev/null +++ b/data/2024/aaai/Object Attribute Matters in Visual Question Answering @@ -0,0 +1 @@ +Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering b/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering new file mode 100644 index 0000000000..542d5f7be1 --- /dev/null +++ b/data/2024/aaai/Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering @@ -0,0 +1 @@ +This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL. \ No newline at end of file diff --git a/data/2024/aaai/Object-Aware Domain Generalization for Object Detection b/data/2024/aaai/Object-Aware Domain Generalization for Object Detection new file mode 100644 index 0000000000..ffd7505d41 --- /dev/null +++ b/data/2024/aaai/Object-Aware Domain Generalization for Object Detection @@ -0,0 +1 @@ +Single-domain generalization (S-DG) aims to generalize a model to unseen environments with a single-source domain. However, most S-DG approaches have been conducted in the field of classification. When these approaches are applied to object detection, the semantic features of some objects can be damaged, which can lead to imprecise object localization and misclassification. To address these problems, we propose an object-aware domain generalization (OA-DG) method for single-domain generalization in object detection. Our method consists of data augmentation and training strategy, which are called OA-Mix and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level transformation and object-aware mixing strategy. OA-Loss enables models to learn domain-invariant representations for objects and backgrounds from the original and OA-Mixed images. Our proposed method outperforms state-of-the-art works on standard benchmarks. Our code is available at https://github.com/WoojuLee24/OA-DG. \ No newline at end of file diff --git a/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer b/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer new file mode 100644 index 0000000000..3e0714f43e --- /dev/null +++ b/data/2024/aaai/Occluded Person Re-identification via Saliency-Guided Patch Transfer @@ -0,0 +1 @@ +While generic person re-identification has made remarkable improvement in recent years, these methods are designed under the assumption that the entire body of the person is available. This assumption brings about a significant performance degradation when suffering from occlusion caused by various obstacles in real-world applications. To address this issue, data-driven strategies have emerged to enhance the model's robustness to occlusion. Following the random erasing paradigm, these strategies typically employ randomly generated noise to supersede randomly selected image regions to simulate obstacles. However, the random strategy is not sensitive to location and content, meaning they cannot mimic real-world occlusion cases in application scenarios. To overcome this limitation and fully exploit the real scene information in datasets, this paper proposes a more intuitive and effective data-driven strategy named Saliency-Guided Patch Transfer (SPT). Combined with the vision transformer, SPT divides person instances and background obstacles using salient patch selection. By transferring person instances to different background obstacles, SPT can easily generate photo-realistic occluded samples. Furthermore, we propose an occlusion-aware Intersection over Union (OIoU) with mask-rolling to filter the more suitable combination and a class-ignoring strategy to achieve more stable processing. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate that SPT provides a significant performance gain among different ViT-based ReID algorithms on occluded ReID. \ No newline at end of file diff --git a/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree b/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree new file mode 100644 index 0000000000..5ebff3a2ce --- /dev/null +++ b/data/2024/aaai/OctOcc: High-Resolution 3D Occupancy Prediction with Octree @@ -0,0 +1,9 @@ +3D semantic occupancy has garnered considerable attention due to its abundant structural information encompassing the entire scene in autonomous driving. +However, existing 3D occupancy prediction methods contend with the constraint of low-resolution 3D voxel features arising from the limitation of computational memory. +To address this limitation and achieve a more fine-grained representation of 3D scenes, we propose OctOcc, a novel octree-based approach for 3D semantic occupancy prediction. +OctOcc is conceptually rooted in the observation that the vast majority of 3D space is left unoccupied. +Capitalizing on this insight, we endeavor to cultivate memory-efficient high-resolution 3D occupancy predictions by mitigating superfluous cross-attentions. +Specifically, we devise a hierarchical octree structure that selectively generates finer-grained cross-attentions solely in potentially occupied regions. +Extending our inquiry beyond 3D space, we identify analogous redundancies within another side of cross attentions, 2D images. +Consequently, a 2D image feature filtering network is conceived to expunge extraneous regions. +Experimental results demonstrate that the proposed OctOcc significantly outperforms existing methods on nuScenes and SemanticKITTI datasets with limited memory consumption. \ No newline at end of file diff --git a/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search b/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search new file mode 100644 index 0000000000..b265f61200 --- /dev/null +++ b/data/2024/aaai/Offline Model-Based Optimization via Policy-Guided Gradient Search @@ -0,0 +1 @@ +Offline optimization is an emerging problem in many experimental engineering domains including protein, drug or aircraft design, where online experimentation to collect evaluation data is too expensive or dangerous. To avoid that, one has to optimize an unknown function given only its offline evaluation at a fixed set of inputs. A naive solution to this problem is to learn a surrogate model of the unknown function and optimize this surrogate instead. However, such a naive optimizer is prone to erroneous overestimation of the surrogate (possibly due to over-fitting on a biased sample of function evaluation) on inputs outside the offline dataset. Prior approaches addressing this challenge have primarily focused on learning robust surrogate models. However, their search strategies are derived from the surrogate model rather than the actual offline data. To fill this important gap, we introduce a new learning-to-search perspective for offline optimization by reformulating it as an offline reinforcement learning problem. Our proposed policy-guided gradient search approach explicitly learns the best policy for a given surrogate model created from the offline data. Our empirical results on multiple benchmarks demonstrate that the learned optimization policy can be combined with existing offline surrogates to significantly improve the optimization performance. \ No newline at end of file diff --git a/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression b/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression new file mode 100644 index 0000000000..acb3efc04c --- /dev/null +++ b/data/2024/aaai/Offline and Online Optical Flow Enhancement for Deep Video Compression @@ -0,0 +1 @@ +Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on two state-of-the-art deep video compression schemes, DCVC and DCVC-DC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 13.4% bitrate saving for DCVC and 4.1% bitrate saving for DCVC-DC on the tested videos, without increasing the model or computational complexity of the decoder side. \ No newline at end of file diff --git a/data/2024/aaai/Omega-Regular Decision Processes b/data/2024/aaai/Omega-Regular Decision Processes new file mode 100644 index 0000000000..9b62b7e1ea --- /dev/null +++ b/data/2024/aaai/Omega-Regular Decision Processes @@ -0,0 +1 @@ +Regular decision processes (RDPs) are a subclass of non-Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct representation of non-Markovian decision processes, their expressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision processes (ODPs) where the non-Markovian aspect of the transition and reward functions are extended to an omega-regular lookahead over the system evolution. Semantically, these lookaheads can be considered as promises made by the decision maker or the learning agent about her future behavior. In particular, we assume that, if the promised lookaheads are not met, then the payoff to the decision maker is falsum (least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction. \ No newline at end of file diff --git a/data/2024/aaai/Omni-Kernel Network for Image Restoration b/data/2024/aaai/Omni-Kernel Network for Image Restoration new file mode 100644 index 0000000000..2664f78612 --- /dev/null +++ b/data/2024/aaai/Omni-Kernel Network for Image Restoration @@ -0,0 +1 @@ +Image restoration aims to reconstruct a high-quality image from a degraded low-quality observation. Recently, Transformer models have achieved promising performance on image restoration tasks due to their powerful ability to model long-range dependencies. However, the quadratically growing complexity with respect to the input size makes them inapplicable to practical applications. In this paper, we develop an efficient convolutional network for image restoration by enhancing multi-scale representation learning. To this end, we propose an omni-kernel module that consists of three branches, i.e., global, large, and local branches, to learn global-to-local feature representations efficiently. Specifically, the global branch achieves a global perceptive field via the dual-domain channel attention and frequency-gated mechanism. Furthermore, to provide multi-grained receptive fields, the large branch is formulated via different shapes of depth-wise convolutions with unusually large kernel sizes. Moreover, we complement local information using a point-wise depth-wise convolution. Finally, the proposed network, dubbed OKNet, is established by inserting the omni-kernel module into the bottleneck position for efficiency. Extensive experiments demonstrate that our network achieves state-of-the-art performance on 11 benchmark datasets for three representative image restoration tasks, including image dehazing, image desnowing, and image defocus deblurring. The code is available at https://github.com/c-yn/OKNet. \ No newline at end of file diff --git a/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion b/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion new file mode 100644 index 0000000000..2fbd121719 --- /dev/null +++ b/data/2024/aaai/Omnidirectional Image Super-resolution via Bi-projection Fusion @@ -0,0 +1 @@ +With the rapid development of virtual reality, omnidirectional images (ODIs) have attracted much attention from both the industrial community and academia. However, due to storage and transmission limitations, the resolution of current ODIs is often insufficient to provide an immersive virtual reality experience. Previous approaches address this issue using conventional 2D super-resolution techniques on equirectangular projection without exploiting the unique geometric properties of ODIs. In particular, the equirectangular projection (ERP) provides a complete field-of-view but introduces significant distortion, while the cubemap projection (CMP) can reduce distortion yet has a limited field-of-view. In this paper, we present a novel Bi-Projection Omnidirectional Image Super-Resolution (BPOSR) network to take advantage of the geometric properties of the above two projections. Then, we design two tailored attention methods for these projections: Horizontal Striped Transformer Block (HSTB) for ERP and Perspective Shift Transformer Block (PSTB) for CMP. Furthermore, we propose a fusion module to make these projections complement each other. Extensive experiments demonstrate that BPOSR achieves state-of-the-art performance on omnidirectional image super-resolution. The code is available at https://github.com/W-JG/BPOSR. \ No newline at end of file diff --git a/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency b/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency new file mode 100644 index 0000000000..3df1284597 --- /dev/null +++ b/data/2024/aaai/Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency @@ -0,0 +1 @@ +Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries. \ No newline at end of file diff --git a/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing b/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing new file mode 100644 index 0000000000..6418fdc4d2 --- /dev/null +++ b/data/2024/aaai/On Alternating-Time Temporal Logic, Hyperproperties, and Strategy Sharing @@ -0,0 +1,7 @@ +Alternating-time temporal logic (ATL*) is a well-established framework for formal reasoning about multi-agent systems. +However, while ATL* can reason about the strategic ability of agents (e.g., some coalition A can ensure that a goal is reached eventually), we cannot compare multiple strategic interactions, nor can we require multiple agents to follow the same strategy. +For example, we cannot state that coalition A can reach a goal sooner (or more often) than some other coalition A'. +In this paper, we propose HyperATL*_S, an extension of ATL* in which we can (1) compare the outcome of multiple strategic interactions w.r.t. a hyperproperty, i.e., a property that refers to multiple paths at the same time, and (2) enforce that some agents share the same strategy. +We show that HyperATL*_S is a rich specification language that captures important AI-related properties that were out of reach of existing logics. +We prove that model checking of HyperATL*_S on concurrent game structures is decidable. +We implement our model-checking algorithm in a tool we call HyMASMC and evaluate it on a range of benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles b/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles new file mode 100644 index 0000000000..d9255a4c7f --- /dev/null +++ b/data/2024/aaai/On Computing Makespan-Optimal Solutions for Generalized Sliding-Tile Puzzles @@ -0,0 +1 @@ +In the 15-puzzle game, 15 labeled square tiles are reconfigured on a 4 × 4 board through an escort, wherein each (time) step, a single tile neighboring it may slide into it, leaving the space previously occupied by the tile as the new escort. We study a generalized sliding-tile puzzle (GSTP) in which (1) there are 1+ escorts and (2) multiple tiles can move synchronously in a single time step. Compared with popular discrete multi-agent/robot motion models, GSTP provides a more accurate model for a broad array of high-utility applications, including warehouse automation and autonomous garage parking, but is less studied due to the more involved tile interactions. In this work, we analyze optimal GSTP solution structures, establishing that computing makespan optimal solutions for GSTP is NP-complete and developing polynomial time algorithms yielding makespans approximating the minimum with expected/high probability constant factors, assuming randomized start and goal configurations. \ No newline at end of file diff --git a/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning b/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning new file mode 100644 index 0000000000..296f24f2e6 --- /dev/null +++ b/data/2024/aaai/On Disentanglement of Asymmetrical Knowledge Transfer for Modality-Task Agnostic Federated Learning @@ -0,0 +1 @@ +There has been growing concern regarding data privacy during the development and deployment of Multimodal Foundation Models for Artificial General Intelligence (AGI), while Federated Learning (FL) allows multiple clients to collaboratively train models in a privacy-preserving manner. This paper formulates and studies Modality-task Agnostic Federated Learning (AFL) to pave the way toward privacy-preserving AGI. A unique property of AFL is the asymmetrical knowledge relationships among clients due to modality gaps, task gaps, and domain shifts between clients. This raises a challenge in learning an optimal inter-client information-sharing scheme that maximizes positive transfer and minimizes negative transfer for AFL. However, prior FL methods, mostly focusing on symmetrical knowledge transfer, tend to exhibit insufficient positive transfer and fail to fully avoid negative transfer during inter-client collaboration. To address this issue, we propose DisentAFL, which leverages a two-stage Knowledge Disentanglement and Gating mechanism to explicitly decompose the original asymmetrical inter-client information-sharing scheme into several independent symmetrical inter-client information-sharing schemes, each of which corresponds to certain semantic knowledge type learned from the local tasks. Experimental results demonstrate the superiority of our method on AFL than baselines. \ No newline at end of file diff --git a/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design b/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design new file mode 100644 index 0000000000..d05bb6a744 --- /dev/null +++ b/data/2024/aaai/On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design @@ -0,0 +1 @@ +Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEG-MCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments. \ No newline at end of file diff --git a/data/2024/aaai/On Inference Stability for Diffusion Models b/data/2024/aaai/On Inference Stability for Diffusion Models new file mode 100644 index 0000000000..d00fa9caa4 --- /dev/null +++ b/data/2024/aaai/On Inference Stability for Diffusion Models @@ -0,0 +1 @@ +Denoising Probabilistic Models (DPMs) represent an emerging domain of generative models that excel in generating diverse and high-quality images. However, most current training methods for DPMs often neglect the correlation between timesteps, limiting the model's performance in generating images effectively. Notably, we theoretically point out that this issue can be caused by the cumulative estimation gap between the predicted and the actual trajectory. To minimize that gap, we propose a novel sequence-aware loss that aims to reduce the estimation gap to enhance the sampling quality. Furthermore, we theoretically show that our proposed loss function is a tighter upper bound of the estimation loss in comparison with the conventional loss in DPMs. Experimental results on several benchmark datasets including CIFAR10, CelebA, and CelebA-HQ consistently show a remarkable improvement of our proposed method regarding the image generalization quality measured by FID and Inception Score compared to several DPM baselines. Our code and pre-trained checkpoints are available at https://github.com/VinAIResearch/SA-DPM. \ No newline at end of file diff --git a/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare b/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare new file mode 100644 index 0000000000..3670181cb0 --- /dev/null +++ b/data/2024/aaai/On Optimal Tradeoffs between EFX and Nash Welfare @@ -0,0 +1 @@ +A major problem in fair division is how to allocate a set of indivisible resources among agents fairly and efficiently. The goal of this work is to characterize the tradeoffs between two well-studied measures of fairness and efficiency --- envy freeness up to any item (EFX) for fairness, and Nash welfare for efficiency --- by saying, for given constants α and β, whether there exists an α-EFX allocation that guarantees a β-fraction of the maximum Nash welfare (β-MNW). For additive valuations, we show that for any α ∈ [0,1], there exists a partial allocation that is α-EFX and 1/(α+1)-MNW. This tradeoff turns out to be tight (for every α) as demonstrated by an impossibility result that we give. We also show that for α ∈ [0, φ-1 ≃ 0.618] these partial allocations can be turned into complete allocations where all items are assigned. Furthermore, for any α ∈ [0, 1/2], we show that the tight tradeoff of α-EFX and 1/(α+1)-MNW with complete allocations holds for the more general setting of subadditive valuations. Our results improve upon the current state of the art, for both additive and subadditive valuations, and match the best-known approximations of EFX under complete allocations, regardless of Nash welfare guarantees. Notably, our constructions for additive valuations also provide EF1 and constant approximations for maximin share guarantees. \ No newline at end of file diff --git a/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods b/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods new file mode 100644 index 0000000000..2f903ae368 --- /dev/null +++ b/data/2024/aaai/On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and Efficient Gradient Methods @@ -0,0 +1 @@ +This paper studies the Partial Optimal Transport (POT) problem between two unbalanced measures with at most n supports and its applications in various AI tasks such as color transfer or domain adaptation. There is hence a need for fast approximations of POT with increasingly large problem sizes in arising applications. We first theoretically and experimentally investigate the infeasibility of the state-of-the-art Sinkhorn algorithm for POT, which consequently degrades its qualitative performance in real world applications like point-cloud registration. To this end, we propose a novel rounding algorithm for POT, and then provide a feasible Sinkhorn procedure with a revised computation complexity of O(n^2/epsilon^4). Our rounding algorithm also permits the development of two first-order methods to approximate the POT problem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent (APDAGD), finds an epsilon-approximate solution to the POT problem in O(n^2.5/epsilon). The second method, Dual Extrapolation, achieves the computation complexity of O(n^2/epsilon), thereby being the best in the literature. We further demonstrate the flexibility of POT compared to standard OT as well as the practicality of our algorithms on real applications where two marginal distributions are unbalanced. \ No newline at end of file diff --git a/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning b/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning new file mode 100644 index 0000000000..65b793cd17 --- /dev/null +++ b/data/2024/aaai/On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning @@ -0,0 +1 @@ +To date, a backbone of methods for unsupervised domain adaptation (UDA) involves learning label-discriminative features via a label classifier and domain-invariant features through a domain discriminator in an adversarial scheme. However, these methods lack explicit control for aligning the source data and target data within the same label class, degrading the classifier's performance in the target domain. In this paper, we propose PL-Mix, a pseudo label guided Mixup method based on adversarial prompt tuning. Specifically, our PL-Mix facilitates class-dependent alignment and can alleviate the impact of noisy pseudo-labels. We then theoretically justify that PL-Mix can improve the generalization for UDA. Extensive experiments of the comparison with existing models also demonstrate the effectiveness of PL-Mix. \ No newline at end of file diff --git a/data/2024/aaai/On the Actionability of Outcome Prediction b/data/2024/aaai/On the Actionability of Outcome Prediction new file mode 100644 index 0000000000..b0cffe3345 --- /dev/null +++ b/data/2024/aaai/On the Actionability of Outcome Prediction @@ -0,0 +1,6 @@ +Predicting future outcomes is a prevalent application of machine learning in social impact domains. Examples range from predicting student success in education to predicting disease risk in healthcare. Practitioners recognize that the ultimate goal is not just to predict but to act effectively. Increasing evidence suggests that relying on outcome predictions for downstream interventions may not have desired results. + +In most domains there exists a multitude of possible interventions for each individual, making the challenge of taking effective action more acute. Even when causal mechanisms connecting the individual's latent states to outcomes are well understood, in any given instance (a specific student or patient), practitioners still need to infer---from budgeted measurements of latent states---which of many possible interventions will be most effective for this individual. With this in mind, we ask: when are accurate predictors of outcomes helpful for identifying the most suitable intervention? + +Through a simple model encompassing actions, latent states, and measurements, we demonstrate that pure outcome prediction rarely results in the most effective policy for taking actions, even when combined with other measurements. +We find that except in cases where there is a single decisive action for improving the outcome, outcome prediction never maximizes "action value", the utility of taking actions. Making measurements of actionable latent states, where specific actions lead to desired outcomes, may considerably enhance the action value compared to outcome prediction, and the degree of improvement depends on action costs and the outcome model. This analysis emphasizes the need to go beyond generic outcome prediction in interventional settings by incorporating knowledge of plausible actions and latent states. \ No newline at end of file diff --git a/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling b/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling new file mode 100644 index 0000000000..29cda18590 --- /dev/null +++ b/data/2024/aaai/On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling @@ -0,0 +1 @@ +Hierarchical topic modeling aims to discover latent topics from a corpus and organize them into a hierarchy to understand documents with desirable semantic granularity. However, existing work struggles with producing topic hierarchies of low affinity, rationality, and diversity, which hampers document understanding. To overcome these challenges, we in this paper propose Transport Plan and Context-aware Hierarchical Topic Model (TraCo). Instead of early simple topic dependencies, we propose a transport plan dependency method. It constrains dependencies to ensure their sparsity and balance, and also regularizes topic hierarchy building with them. This improves affinity and diversity of hierarchies. We further propose a context-aware disentangled decoder. Rather than previously entangled decoding, it distributes different semantic granularity to topics at different levels by disentangled decoding. This facilitates the rationality of hierarchies. Experiments on benchmark datasets demonstrate that our method surpasses state-of-the-art baselines, effectively improving the affinity, rationality, and diversity of hierarchical topic modeling with better performance on downstream tasks. \ No newline at end of file diff --git a/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence b/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence new file mode 100644 index 0000000000..2823360653 --- /dev/null +++ b/data/2024/aaai/On the Computational Complexity of Plan Verification, (Bounded) Plan-Optimality Verification, and Bounded Plan Existence @@ -0,0 +1 @@ +In this paper we study the computational complexity of several reasoning tasks centered around the bounded plan existence problem. We do this for standard classical planning and hierarchical task network (HTN) planning and each for a grounded and a lifted representation. Whereas bounded plan existence complexity is known for classical planning, it has not yet been studied for HTN planning. For plan verification, results were available for both formalisms except for the lifted HTN planning. We will present lower and upper bounds of the complexity of plan verification in lifted HTN planning and provide novel insights into its grounded counterpart, in which we show that verification is not just NP-complete in the general case, but already for a severely restricted special case. Finally, we show the complexity concerning verifying the optimality of a given plan and discuss its connection to the bounded plan existence problem. \ No newline at end of file diff --git a/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models b/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models new file mode 100644 index 0000000000..587a6bb498 --- /dev/null +++ b/data/2024/aaai/On the Concept Trustworthiness in Concept Bottleneck Models @@ -0,0 +1 @@ +Concept Bottleneck Models (CBMs), which break down the reasoning process into the input-to-concept mapping and the concept-to-label prediction, have garnered significant attention due to their remarkable interpretability achieved by the interpretable concept bottleneck. However, despite the transparency of the concept-to-label prediction, the mapping from the input to the intermediate concept remains a black box, giving rise to concerns about the trustworthiness of the learned concepts (i.e., these concepts may be predicted based on spurious cues). The issue of concept untrustworthiness greatly hampers the interpretability of CBMs, thereby hindering their further advancement. To conduct a comprehensive analysis on this issue, in this study we establish a benchmark to assess the trustworthiness of concepts in CBMs. A pioneering metric, referred to as concept trustworthiness score, is proposed to gauge whether the concepts are derived from relevant regions. Additionally, an enhanced CBM is introduced, enabling concept predictions to be made specifically from distinct parts of the feature map, thereby facilitating the exploration of their related regions. Besides, we introduce three modules, namely the cross-layer alignment (CLA) module, the cross-image alignment (CIA) module, and the prediction alignment (PA) module, to further enhance the concept trustworthiness within the elaborated CBM. The experiments on five datasets across ten architectures demonstrate that without using any concept localization annotations during training, our model improves the concept trustworthiness by a large margin, meanwhile achieving superior accuracy to the state-of-the-arts. Our code is available at https://github.com/hqhQAQ/ProtoCBM. \ No newline at end of file diff --git a/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks b/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks new file mode 100644 index 0000000000..980cdcabe8 --- /dev/null +++ b/data/2024/aaai/On the Convergence of an Adaptive Momentum Method for Adversarial Attacks @@ -0,0 +1 @@ +Adversarial examples are commonly created by solving a constrained optimization problem, typically using sign-based methods like Fast Gradient Sign Method (FGSM). These attacks can benefit from momentum with a constant parameter, such as Momentum Iterative FGSM (MI-FGSM), to enhance black-box transferability. However, the monotonic time-varying momentum parameter is required to guarantee convergence in theory, creating a theory-practice gap. Additionally, recent work shows that sign-based methods fail to converge to the optimum in several convex settings, exacerbating the issue. To address these concerns, we propose a novel method which incorporates both an innovative adaptive momentum parameter without monotonicity assumptions and an adaptive step-size scheme that replaces the sign operation. Furthermore, we derive a regret upper bound for general convex functions. Experiments on multiple models demonstrate the efficacy of our method in generating adversarial examples with human-imperceptible noise while achieving high attack success rates, indicating its superiority over previous adversarial example generation methods. \ No newline at end of file diff --git a/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades b/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades new file mode 100644 index 0000000000..06f4830a5d --- /dev/null +++ b/data/2024/aaai/On the Expressivity of Recurrent Neural Cascades @@ -0,0 +1,2 @@ +Recurrent Neural Cascades (RNCs) are the recurrent neural networks with no cyclic dependencies among recurrent neurons. This class of recurrent networks has received a lot of attention in practice. Besides training methods for a fixed architecture such as backpropagation, the cascade architecture naturally allows for constructive learning methods, where recurrent nodes are added incrementally one at a time, often yielding smaller networks. Furthermore, acyclicity amounts to a structural prior that even for the same number of neurons yields a more favourable sample complexity compared to a fully-connected architecture. +A central question is whether the advantages of the cascade architecture come at the cost of a reduced expressivity. We provide new insights into this question. We show that the regular languages captured by RNCs with sign and tanh activation with positive recurrent weights are the star-free regular languages. In order to establish our results we developed a novel framework where capabilities of RNCs are assessed by analysing which semigroups and groups a single neuron is able to implement. A notable implication of our framework is that RNCs can achieve the expressivity of all regular languages by introducing neurons that can implement groups. \ No newline at end of file diff --git a/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods b/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods new file mode 100644 index 0000000000..7f1a7a90ea --- /dev/null +++ b/data/2024/aaai/On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods @@ -0,0 +1 @@ +Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, generally leading to overestimation of methods' real-world utility. In this work, we seek to address this by conducting a study that evaluates post-hoc explainable ML methods in a setting consistent with the application context and provide a template for future evaluation studies. We modify and improve a prior study on e-commerce fraud detection by relaxing the original work's simplifying assumptions that departed from the deployment context. Our study finds no evidence for the utility of the tested explainable ML methods in the context, which is a drastically different conclusion from the earlier work. This highlights how seemingly trivial experimental design choices can yield misleading conclusions about method utility. In addition, our work carries lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended application context but also developing methods tailored to specific applications, moving beyond general-purpose explainable ML methods. \ No newline at end of file diff --git a/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks b/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks new file mode 100644 index 0000000000..70e92316aa --- /dev/null +++ b/data/2024/aaai/On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks @@ -0,0 +1 @@ +The explosive growth of video traffic on today's Internet promotes the rise of Neural-enhanced Video Streaming (NeVS), which effectively improves the rate-distortion trade-off by employing a cheap neural super-resolution model for quality enhancement on the receiver side. Missing by existing work, we reveal that the NeVS pipeline may suffer from a practical threat, where the crucial codec component (i.e., encoder for compression and decoder for restoration) can trigger adversarial attacks in a man-in-the-middle manner to significantly destroy video recovery performance and finally incurs the malfunction of downstream video perception tasks. In this paper, we are the first attempt to inspect the vulnerability of NeVS and discover a novel adversarial attack, called codec hijacking, where the injected invisible perturbation conspires with the malicious encoding matrix by reorganizing the spatial-temporal bit allocation within the bitstream size budget. Such a zero-day vulnerability makes our attack hard to defend because there is no visual distortion on the recovered videos until the attack happens. More seriously, this attack can be extended to diverse enhancement models, thus exposing a wide range of video perception tasks under threat. Evaluation based on state-of-the-art video codec benchmark illustrates that our attack significantly degrades the recovery performance of NeVS over previous attack methods. The damaged video quality finally leads to obvious malfunction of downstream tasks with over 75% success rate. We hope to arouse public attention on codec hijacking and its defence. \ No newline at end of file diff --git a/data/2024/aaai/On the Role of Server Momentum in Federated Learning b/data/2024/aaai/On the Role of Server Momentum in Federated Learning new file mode 100644 index 0000000000..370fb4cf0d --- /dev/null +++ b/data/2024/aaai/On the Role of Server Momentum in Federated Learning @@ -0,0 +1 @@ +Federated Averaging (FedAvg) is known to experience convergence issues when encountering significant clients system heterogeneity and data heterogeneity. Server momentum has been proposed as an effective mitigation. However, existing server momentum works are restrictive in the momentum formulation, do not properly schedule hyperparameters and focus only on system homogeneous settings, which leaves the role of server momentum still an under-explored problem. In this paper, we propose a general framework for server momentum, that (a) covers a large class of momentum schemes that are unexplored in federated learning (FL), (b) enables a popular stagewise hyperparameter scheduler, (c) allows heterogeneous and asynchronous local computing. We provide rigorous convergence analysis for the proposed framework. To our best knowledge, this is the first work that thoroughly analyzes the performances of server momentum with a hyperparameter scheduler and system heterogeneity. Extensive experiments validate the effectiveness of our proposed framework. Due to page limit, we leave all proofs to the full version https://arxiv.org/abs/2312.12670. \ No newline at end of file diff --git a/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions? b/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions? new file mode 100644 index 0000000000..88d21c098b --- /dev/null +++ b/data/2024/aaai/On the Structural Hardness of Answer Set Programming: Can Structure Efficiently Confine the Power of Disjunctions? @@ -0,0 +1,2 @@ +Answer Set Programming (ASP) is a generic problem modeling and solving framework with a strong focus on knowledge representation and a rapid growth of industrial applications. So far, the study of complexity resulted in characterizing hardness and determining their sources, fine-grained insights in the form of dichotomy-style results, as well as detailed parameterized complexity landscapes. Unfortunately, for the well-known parameter treewidth disjunctive programs require double-exponential runtime under reasonable complexity assumptions. This quickly becomes out of reach. We deal with the classification of structural parameters for disjunctive ASP on the program's rule structure (incidence graph). +First, we provide a polynomial kernel to obtain single-exponential runtime in terms of vertex cover size, despite subset-minimization being not represented in the program’s structure. Then we turn our attention to strictly better structural parameters between vertex cover size and treewidth. Here, we provide double-exponential lower bounds for the most prominent parameters in that range: treedepth, feedback vertex size, and cliquewidth. Based on this, we argue that unfortunately our options beyond vertex cover size are limited. Our results provide an in-depth hardness study, relying on a novel reduction from normal to disjunctive programs, trading the increase of complexity for an exponential parameter compression. \ No newline at end of file diff --git a/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent b/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent new file mode 100644 index 0000000000..40c7e8f99a --- /dev/null +++ b/data/2024/aaai/On the Unstable Convergence Regime of Gradient Descent @@ -0,0 +1 @@ +Traditional gradient descent (GD) has been fully investigated for convex or L-smoothness functions, and it is widely utilized in current neural network optimization. The classical descent lemma ensures that for a function with L-smoothness, the GD trajectory converges stably towards the minimum when the learning rate is below 2 / L. This convergence is marked by a consistent reduction in the loss function throughout the iterations. However, recent experimental studies have demonstrated that even when the L-smoothness condition is not met, or if the learning rate is increased leading to oscillations in the loss function during iterations, the GD trajectory still exhibits convergence over the long run. This phenomenon is referred to as the unstable convergence regime of GD. In this paper, we present a theoretical perspective to offer a qualitative analysis of this phenomenon. The unstable convergence is in fact an inherent property of GD for general twice differentiable functions. Specifically, the forwardinvariance of GD is established, i.e., it ensures that any point within a local region will always remain within this region under GD iteration. Then, based on the forward-invariance, for the initialization outside an open set containing the local minimum, the loss function will oscillate at the first several iterations and then become monotonely decreasing after the GD trajectory jumped into the open set. This work theoretically clarifies the unstable convergence phenomenon of GD discussed in previous experimental works. The unstable convergence of GD mainly depends on the selection of the initialization, and it is actually inevitable due to the complex nature of loss function. \ No newline at end of file diff --git a/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval b/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval new file mode 100644 index 0000000000..b336d93b1e --- /dev/null +++ b/data/2024/aaai/Once and for All: Universal Transferable Adversarial Perturbation against Deep Hashing-Based Facial Image Retrieval @@ -0,0 +1 @@ +Deep Hashing (DH)-based image retrieval has been widely applied to face-matching systems due to its accuracy and efficiency. However, this convenience comes with an increased risk of privacy leakage. DH models inherit the vulnerability to adversarial attacks, which can be used to prevent the retrieval of private images. Existing adversarial attacks against DH typically target a single image or a specific class of images, lacking universal adversarial perturbation for the entire hash dataset. In this paper, we propose the first universal transferable adversarial perturbation against DH-based facial image retrieval, a single perturbation can protect all images. Specifically, we explore the relationship between clusters learned by different DH models and define the optimization objective of universal perturbation as leaving from the overall hash center. To mitigate the challenge of single-objective optimization, we randomly obtain sub-cluster centers and further propose sub-task-based meta-learning to aid in overall optimization. We test our method with popular facial datasets and DH models, indicating impressive cross-image, -identity, -model, and -scheme universal anti-retrieval performance. Compared to state-of-the-art methods, our performance is competitive in white-box settings and exhibits significant improvements of 10%-70% in transferability in all black-box settings. \ No newline at end of file diff --git a/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems b/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems new file mode 100644 index 0000000000..c608041928 --- /dev/null +++ b/data/2024/aaai/One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems @@ -0,0 +1 @@ +Abstract Visual Reasoning (AVR) comprises a wide selection of various problems similar to those used in human IQ tests. Recent years have brought dynamic progress in solving particular AVR tasks, however, in the contemporary literature AVR problems are largely dealt with in isolation, leading to highly specialized task-specific methods. With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice Abstract visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels. The proposed model relies on a novel Structure-Aware dynamic Layer (SAL), which adapts its weights to the structure of the considered AVR problem. Experiments conducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One Out problems show that SCAR (SAL-based models, in general) effectively solves diverse AVR tasks, and its performance is on par with the state-of-the-art task-specific baselines. What is more, SCAR demonstrates effective knowledge reuse in multi-task and transfer learning settings. To our knowledge, this work is the first successful attempt to construct a general single-choice AVR solver relying on self-configurable architecture and unified solving method. With this work we aim to stimulate and foster progress on task-independent research paths in the AVR domain, with the long-term goal of development of a general AVR solver. \ No newline at end of file diff --git a/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation b/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation new file mode 100644 index 0000000000..655911b393 --- /dev/null +++ b/data/2024/aaai/One Step Closer to Unbiased Aleatoric Uncertainty Estimation @@ -0,0 +1 @@ +Neural networks are powerful tools in various applications, and quantifying their uncertainty is crucial for reliable decision-making. In the deep learning field, the uncertainties are usually categorized into aleatoric (data) and epistemic (model) uncertainty. In this paper, we point out that the existing popular variance attenuation method highly overestimates aleatoric uncertainty. To address this issue, we proposed a new estimation method by actively de-noising the observed data. By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method. \ No newline at end of file diff --git a/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception b/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception new file mode 100644 index 0000000000..eb69414cd9 --- /dev/null +++ b/data/2024/aaai/One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception @@ -0,0 +1 @@ +Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multi-view stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Specifically, we first build a coarse probability volume from input images with the off-the-shelf scene perception baselines, which is then conditioned as the basic geometry prior before being fed into a 3D diffusion UNet, to progressively achieve accurate probability distribution modeling. To handle the corner cases in challenging areas, a Confidence-Aware Contextual Collaboration (CACC) module is developed to correct the uncertain regions for reliable volumetric learning based on multi-scale contextual contents. Moreover, an Online Filtering (OF) strategy is designed to maintain representation consistency for stable diffusion sampling. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset. \ No newline at end of file diff --git a/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification b/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification new file mode 100644 index 0000000000..29f5cd6586 --- /dev/null +++ b/data/2024/aaai/Online Boosting Adaptive Learning under Concept Drift for Multistream Classification @@ -0,0 +1 @@ +Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift. Despite the growing research outcomes in this area, there has been a notable oversight regarding the temporal dynamic relationships between these streams, leading to the issue of negative transfer arising from irrelevant data. In this paper, we propose a novel Online Boosting Adaptive Learning (OBAL) method that effectively addresses this limitation by adaptively learning the dynamic correlation among different streams. Specifically, OBAL operates in a dual-phase mechanism, in the first of which we design an Adaptive COvariate Shift Adaptation (AdaCOSA) algorithm to construct an initialized ensemble model using archived data from various source streams, thus mitigating the covariate shift while learning the dynamic correlations via an adaptive re-weighting strategy. During the online process, we employ a Gaussian Mixture Model-based weighting mechanism, which is seamlessly integrated with the acquired correlations via AdaCOSA to effectively handle asynchronous drift. This approach significantly improves the predictive performance and stability of the target stream. We conduct comprehensive experiments on several synthetic and real-world data streams, encompassing various drifting scenarios and types. The results clearly demonstrate that OBAL achieves remarkable advancements in addressing multistream classification problems by effectively leveraging positive knowledge derived from multiple sources. \ No newline at end of file diff --git a/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback b/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback new file mode 100644 index 0000000000..90fa10de37 --- /dev/null +++ b/data/2024/aaai/Online Conversion Rate Prediction via Multi-Interval Screening and Synthesizing under Delayed Feedback @@ -0,0 +1 @@ +Due to the widespread adoption of the cost-per-action(CPA) display strategy that demands a real-time conversion rate prediction(CVR), delayed feedback is becoming one of the major challenges in online advertising. As the true labels of a significant quantity of samples are only available after long delays, the observed training data are usually biased, harming the performance of models. Recent studies show integrating models with varying waiting windows to observe true labels is beneficial, but the aggregation framework remains far from reaching a consensus. In this work, we propose the Multi-Interval Screening and Synthesizing model (MISS for short) for online CVR prediction. We first design a multi-interval screening model with various output heads to produce accurate and distinctive estimates. Then a light-weight synthesizing model with an assembled training pipeline is applied to thoroughly exploit the knowledge and relationship among heads, obtaining reliable predictions. Extensive experiments on two real-world advertising datasets validate the effectiveness of our model. \ No newline at end of file diff --git a/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space b/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space new file mode 100644 index 0000000000..6348c8b1cd --- /dev/null +++ b/data/2024/aaai/Online Markov Decision Processes Configuration with Continuous Decision Space @@ -0,0 +1 @@ +In this paper, we investigate the optimal online configuration of episodic Markov decision processes when the space of the possible configurations is continuous. Specifically, we study the interaction between a learner (referred to as the configurator) and an agent with a fixed, unknown policy, when the learner aims to minimize her losses by choosing transition functions in online fashion. The losses may be unrelated to the agent's rewards. This problem applies to many real-world scenarios where the learner seeks to manipulate the Markov decision process to her advantage. We study both deterministic and stochastic settings, where the losses are either fixed or sampled from an unknown probability distribution. We design two algorithms whose peculiarity is to rely on occupancy measures to explore with optimism the continuous space of transition functions, achieving constant regret in deterministic settings and sublinear regret in stochastic settings, respectively. Moreover, we prove that the regret bound is tight with respect to any constant factor in deterministic settings. Finally, we compare the empiric performance of our algorithms with a baseline in synthetic experiments. \ No newline at end of file diff --git a/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments b/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments new file mode 100644 index 0000000000..c2f1f463e7 --- /dev/null +++ b/data/2024/aaai/Online Reinforcement Learning-Based Pedagogical Planning for Narrative-Centered Learning Environments @@ -0,0 +1 @@ +Pedagogical planners can provide adaptive support to students in narrative-centered learning environments by dynamically scaffolding student learning and tailoring problem scenarios. Reinforcement learning (RL) is frequently used for pedagogical planning in narrative-centered learning environments. However, RL-based pedagogical planning raises significant challenges due to the scarcity of data for training RL policies. Most prior work has relied on limited-size datasets and offline RL techniques for policy learning. Unfortunately, offline RL techniques do not support on-demand exploration and evaluation, which can adversely impact the quality of induced policies. To address the limitation of data scarcity and offline RL, we propose INSIGHT, an online RL framework for training data-driven pedagogical policies that optimize student learning in narrative-centered learning environments. The INSIGHT framework consists of three components: a narrative-centered learning environment simulator, a simulated student agent, and an RL-based pedagogical planner agent, which uses a reward metric that is associated with effective student learning processes. The framework enables the generation of synthetic data for on-demand exploration and evaluation of RL-based pedagogical planning. We have implemented INSIGHT with OpenAI Gym for a narrative-centered learning environment testbed with rule-based simulated student agents and a deep Q-learning-based pedagogical planner. Our results show that online deep RL algorithms can induce near-optimal pedagogical policies in the INSIGHT framework, while offline deep RL algorithms only find suboptimal policies even with large amounts of data. \ No newline at end of file diff --git a/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints b/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints new file mode 100644 index 0000000000..17e923ae07 --- /dev/null +++ b/data/2024/aaai/Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints @@ -0,0 +1 @@ +Restless multi-armed bandits (RMAB) have been widely used to model sequential decision making problems with constraints. The decision maker (DM) aims to maximize the expected total reward over an infinite horizon under an “instantaneous activation constraint” that at most B arms can be activated at any decision epoch, where the state of each arm evolves stochastically according to a Markov decision process (MDP). However, this basic model fails to provide any fairness guarantee among arms. In this paper, we introduce RMAB-F, a new RMAB model with “long-term fairness constraints”, where the objective now is to maximize the longterm reward while a minimum long-term activation fraction for each arm must be satisfied. For the online RMAB-F setting (i.e., the underlying MDPs associated with each arm are unknown to the DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL. We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret. Compared with off-the-shelf RL methods, our Fair-UCRL is much more computationally efficient since it contains a novel exploitation that leverages a low-complexity index policy for making decisions. Experimental results further demonstrate the effectiveness of our Fair-UCRL. \ No newline at end of file diff --git a/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning b/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning new file mode 100644 index 0000000000..888ce7ee9c --- /dev/null +++ b/data/2024/aaai/Online Sensitivity Optimization in Differentially Private Learning @@ -0,0 +1 @@ +Training differentially private machine learning models requires constraining an individual's contribution to the optimization process. This is achieved by clipping the 2-norm of their gradient at a predetermined threshold prior to averaging and batch sanitization. This selection adversely influences optimization in two opposing ways: it either exacerbates the bias due to excessive clipping at lower values, or augments sanitization noise at higher values. The choice significantly hinges on factors such as the dataset, model architecture, and even varies within the same optimization, demanding meticulous tuning usually accomplished through a grid search. In order to circumvent the privacy expenses incurred in hyperparameter tuning, we present a novel approach to dynamically optimize the clipping threshold. We treat this threshold as an additional learnable parameter, establishing a clean relationship between the threshold and the cost function. This allows us to optimize the former with gradient descent, with minimal repercussions on the overall privacy analysis. Our method is thoroughly assessed against alternative fixed and adaptive strategies across diverse datasets, tasks, model dimensions, and privacy levels. Our results indicate that it performs comparably or better in the evaluated scenarios, given the same privacy requirements. \ No newline at end of file diff --git a/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning b/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning new file mode 100644 index 0000000000..65adab6ba5 --- /dev/null +++ b/data/2024/aaai/OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning @@ -0,0 +1 @@ +Large language models (LLMs) have demonstrated impressive proficiency in information retrieval, while they are prone to generating incorrect responses that conflict with reality, a phenomenon known as intrinsic hallucination. The critical challenge lies in the unclear and unreliable fact distribution within LLMs trained on vast amounts of data. The prevalent approach frames the factual detection task as a question-answering paradigm, where the LLMs are asked about factual knowledge and examined for correctness. However, existing studies primarily focused on deriving test cases only from several specific domains, such as movies and sports, limiting the comprehensive observation of missing knowledge and the analysis of unexpected hallucinations. To address this issue, we propose OntoFact, an adaptive framework for detecting unknown facts of LLMs, devoted to mining the ontology-level skeleton of the missing knowledge. Specifically, we argue that LLMs could expose the ontology-based similarity among missing facts and introduce five representative knowledge graphs (KGs) as benchmarks. We further devise a sophisticated ontology-driven reinforcement learning (ORL) mechanism to produce error-prone test cases with specific entities and relations automatically. The ORL mechanism rewards the KGs for navigating toward a feasible direction for unveiling factual errors. Moreover, empirical efforts demonstrate that dominant LLMs are biased towards answering Yes rather than No, regardless of whether this knowledge is included. To mitigate the overconfidence of LLMs, we leverage a hallucination-free detection (HFD) strategy to tackle unfair comparisons between baselines, thereby boosting the result robustness. Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, ChatGPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs. Compared to the exhaustive testing, ORL achieves an average recall of 80% while reducing evaluation time by 35.29% to 63.12%. \ No newline at end of file diff --git a/data/2024/aaai/Open-Set Facial Expression Recognition b/data/2024/aaai/Open-Set Facial Expression Recognition new file mode 100644 index 0000000000..3cf4bfc785 --- /dev/null +++ b/data/2024/aaai/Open-Set Facial Expression Recognition @@ -0,0 +1 @@ +Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works (Cowen et al. 2021; Bryant et al. 2022; Kollias 2023) point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To address this issue, we propose the open-set FER task for the first time. Though there are many existing open-set recognition methods, we argue that they do not work well for open-set FER because FER data are all human faces with very small inter-class distances, which makes the open-set samples very similar to close-set samples. In this paper, we are the first to transform the disadvantage of small inter-class distance into an advantage by proposing a new way for open-set FER. Specifically, we find that small inter-class distance allows for sparsely distributed pseudo labels of open-set samples, which can be viewed as symmetric noisy labels. Based on this novel observation, we convert the open-set FER to a noisy label detection problem. We further propose a novel method that incorporates attention map consistency and cycle training to detect the open-set samples. Extensive experiments on various FER datasets demonstrate that our method clearly outperforms state-of-the-art open-set recognition methods by large margins. Code is available at https://github.com/zyh-uaiaaaa. \ No newline at end of file diff --git a/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment b/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment new file mode 100644 index 0000000000..5dd785ea8d --- /dev/null +++ b/data/2024/aaai/Open-Set Graph Domain Adaptation via Separate Domain Alignment @@ -0,0 +1 @@ +Domain adaptation has become an attractive learning paradigm, as it can leverage source domains with rich labels to deal with classification tasks in an unlabeled target domain. A few recent studies develop domain adaptation approaches for graph-structured data. In the case of node classification task, current domain adaptation methods only focus on the closed-set setting, where source and target domains share the same label space. A more practical assumption is that the target domain may contain new classes that are not included in the source domain. Therefore, in this paper, we introduce a novel and challenging problem for graphs, i.e., open-set domain adaptive node classification, and propose a new approach to solve it. Specifically, we develop an algorithm for efficient knowledge transfer from a labeled source graph to an unlabeled target graph under a separate domain alignment (SDA) strategy, in order to learn discriminative feature representations for the target graph. Our goal is to not only correctly classify target nodes into the known classes, but also classify unseen types of nodes into an unknown class. Experimental results on real-world datasets show that our method outperforms existing methods on graph domain adaptation. \ No newline at end of file diff --git a/data/2024/aaai/Open-Vocabulary Video Relation Extraction b/data/2024/aaai/Open-Vocabulary Video Relation Extraction new file mode 100644 index 0000000000..37264ef600 --- /dev/null +++ b/data/2024/aaai/Open-Vocabulary Video Relation Extraction @@ -0,0 +1,2 @@ +A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. +Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE. \ No newline at end of file diff --git a/data/2024/aaai/Opening the Black Box: Unraveling the Classroom Dialogue Analysis (Student Abstract) b/data/2024/aaai/Opening the Black Box: Unraveling the Classroom Dialogue Analysis (Student Abstract) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition b/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition new file mode 100644 index 0000000000..deb1cea6ac --- /dev/null +++ b/data/2024/aaai/Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition @@ -0,0 +1 @@ +We address the problem of building and evaluating a computational system whose primary objective is creativity. We illustrate seven characteristics for computational creativity in the context of a system that autonomously composes Western lyrical music. We conduct an external evaluation of the system in which respondents rated the system with regard to each characteristic as well as with regard to overall creativity. Average scores for overall creativity exceeded the ratings for any single characteristic, suggesting that creativity may be an emergent property and that unique research opportunities exist for building CC systems whose design attempts to comprehend all known characteristics of creativity. \ No newline at end of file diff --git a/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations b/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations new file mode 100644 index 0000000000..0fdf825867 --- /dev/null +++ b/data/2024/aaai/Operator-Learning-Inspired Modeling of Neural Ordinary Differential Equations @@ -0,0 +1 @@ +Neural ordinary differential equations (NODEs), one of the most influential works of the differential equation-based deep learning, are to continuously generalize residual networks and opened a new field. They are currently utilized for various downstream tasks, e.g., image classification, time series classification, image generation, etc. Its key part is how to model the time-derivative of the hidden state, denoted dh(t)/dt. People have habitually used conventional neural network architectures, e.g., fully-connected layers followed by non-linear activations. In this paper, however, we present a neural operator-based method to define the time-derivative term. Neural operators were initially proposed to model the differential operator of partial differential equations (PDEs). Since the time-derivative of NODEs can be understood as a special type of the differential operator, our proposed method, called branched Fourier neural operator (BFNO), makes sense. In our experiments with general downstream tasks, our method significantly outperforms existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information b/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information new file mode 100644 index 0000000000..a42604ef78 --- /dev/null +++ b/data/2024/aaai/Opponent-Model Search in Games with Incomplete Information @@ -0,0 +1 @@ +Games with incomplete information are games that model situations where players do not have common knowledge about the game they play, e.g. card games such as poker or bridge. Opponent models can be of crucial importance for decision-making in such games. We propose algorithms for computing optimal and/or robust strategies in games with incomplete information, given various types of knowledge about opponent models. As an application, we describe a framework for reasoning about an opponent's reasoning in such games, where opponent models arise naturally. \ No newline at end of file diff --git a/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion b/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion new file mode 100644 index 0000000000..13ddf30b41 --- /dev/null +++ b/data/2024/aaai/Optical Flow for Spike Camera with Hierarchical Spatial-Temporal Spike Fusion @@ -0,0 +1 @@ +As an emerging neuromorphic camera with an asynchronous working mechanism, spike camera shows good potential for high-speed vision tasks. Each pixel in spike camera accumulates photons persistently and fires a spike whenever the accumulation exceeds a threshold. Such high-frequency fine-granularity photon recording facilitates the analysis and recovery of dynamic scenes with high-speed motion. This paper considers the optical flow estimation problem for spike cameras. Due to the Poisson nature of incoming photons, the occurrence of spikes is random and fluctuating, making conventional image matching inefficient. We propose a Hierarchical Spatial-Temporal (HiST) fusion module for spike representation to pursue reliable feature matching and develop a robust optical flow network, dubbed as HiST-SFlow. The HiST extracts features at multiple moments and hierarchically fuses the spatial-temporal information. We also propose an intra-moment filtering module to further extract the feature and suppress the influence of randomness in spikes. A scene loss is proposed to ensure that this hierarchical representation recovers the essential visual information in the scene. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared with the existing methods. The source codes are available at https://github.com/ruizhao26/HiST-SFlow. \ No newline at end of file diff --git a/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning b/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning new file mode 100644 index 0000000000..bbf2978764 --- /dev/null +++ b/data/2024/aaai/Optimal Attack and Defense for Reinforcement Learning @@ -0,0 +1 @@ +To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time b/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time new file mode 100644 index 0000000000..3189e58e76 --- /dev/null +++ b/data/2024/aaai/Optimal Makespan in a Minute Timespan! A Scalable Multi-Robot Goal Assignment Algorithm for Minimizing Mission Time @@ -0,0 +1 @@ +We study a variant of the multi-robot goal assignment problem where a unique goal to each robot needs to be assigned while minimizing the largest cost of movement among the robots, called makespan. A significant step in solving this problem is to find the cost associated with the robot-goal pairs, which requires solving a complex path planning problem. We present OM, a scalable optimal algorithm that solves the multi-robot goal assignment problem by computing the paths for a significantly less number of robot-goal pairs compared to the state-of-the-art algorithms, leading to a computationally superior mechanism to solve the problem. We extensively evaluate our algorithm for hundreds of robots on randomly generated and standard workspaces. Our experimental results demonstrate that the proposed algorithm achieves a noticeable speedup over two state-of-the-art baseline algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining b/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining new file mode 100644 index 0000000000..79aa3e8957 --- /dev/null +++ b/data/2024/aaai/Optimal Quasi-clique: Hardness, Equivalence with Densest-k-Subgraph, and Quasi-partitioned Community Mining @@ -0,0 +1 @@ +Dense subgraph discovery (DSD) is a key primitive in graph mining that typically deals with extracting cliques and near-cliques. In this paper, we revisit the optimal quasi-clique (OQC) formulation for DSD and establish that it is NP--hard. In addition, we reveal the hitherto unknown property that OQC can be used to explore the entire spectrum of densest subgraphs of all distinct sizes by appropriately varying a single hyperparameter, thereby forging an intimate link with the classic densest-k-subgraph problem (DkS). We corroborate these findings on real-world graphs by applying the simple greedy algorithm for OQC with improved hyperparameter tuning, to quickly generate high-quality approximations of the size-density frontier. Our findings indicate that OQC not only extracts high quality (near)-cliques, but also large and loosely-connected subgraphs that exhibit well defined local community structure. The latter discovery is particularly intriguing, since OQC is not explicitly geared towards community detection. \ No newline at end of file diff --git a/data/2024/aaai/Optimal Transport with Cyclic Symmetry b/data/2024/aaai/Optimal Transport with Cyclic Symmetry new file mode 100644 index 0000000000..364164caaa --- /dev/null +++ b/data/2024/aaai/Optimal Transport with Cyclic Symmetry @@ -0,0 +1 @@ +We propose novel fast algorithms for optimal transport (OT) utilizing a cyclic symmetry structure of input data. Such OT with cyclic symmetry appears universally in various real-world examples: image processing, urban planning, and graph processing. Our main idea is to reduce OT to a small optimization problem that has significantly fewer variables by utilizing cyclic symmetry and various optimization techniques. On the basis of this reduction, our algorithms solve the small optimization problem instead of the original OT. As a result, our algorithms obtain the optimal solution and the objective function value of the original OT faster than solving the original OT directly. In this paper, our focus is on two crucial OT formulations: the linear programming OT (LOT) and the strongly convex-regularized OT, which includes the well-known entropy-regularized OT (EROT). Experiments show the effectiveness of our algorithms for LOT and EROT in synthetic/real-world data that has a strict/approximate cyclic symmetry structure. Through theoretical and experimental results, this paper successfully introduces the concept of symmetry into the OT research field for the first time. \ No newline at end of file diff --git a/data/2024/aaai/Optimal Transport with Tempered Exponential Measures b/data/2024/aaai/Optimal Transport with Tempered Exponential Measures new file mode 100644 index 0000000000..3bad061289 --- /dev/null +++ b/data/2024/aaai/Optimal Transport with Tempered Exponential Measures @@ -0,0 +1 @@ +In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to maximally un-sparse plans. In this paper, we show that an extension of the latter to tempered exponential measures, a generalization of exponential families with indirect measure normalization, gets to a very convenient middle ground, with both very fast approximation algorithms and sparsity, which is under control up to sparsity patterns. In addition, our formulation fits naturally in the unbalanced optimal transport problem setting. \ No newline at end of file diff --git a/data/2024/aaai/Optimised Storage for Datalog Reasoning b/data/2024/aaai/Optimised Storage for Datalog Reasoning new file mode 100644 index 0000000000..896932fead --- /dev/null +++ b/data/2024/aaai/Optimised Storage for Datalog Reasoning @@ -0,0 +1 @@ +Materialisation facilitates Datalog reasoning by precomputing all consequences of the facts and the rules so that queries can be directly answered over the materialised facts. However, storing all materialised facts may be infeasible in practice, especially when the rules are complex and the given set of facts is large. We observe that for certain combinations of rules, there exist data structures that compactly represent the reasoning result and can be efficiently queried when necessary. In this paper, we present a general framework that allows for the integration of such optimised storage schemes with standard materialisation algorithms. Moreover, we devise optimised storage schemes targeting at transitive rules and union rules, two types of (combination of) rules that commonly occur in practice. Our experimental evaluation shows that our approach significantly improves memory consumption, sometimes by orders of magnitude, while remaining competitive in terms of query answering time. \ No newline at end of file diff --git a/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization b/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization new file mode 100644 index 0000000000..649524cdec --- /dev/null +++ b/data/2024/aaai/Optimistic Model Rollouts for Pessimistic Offline Policy Optimization @@ -0,0 +1 @@ +Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization. \ No newline at end of file diff --git a/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property b/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property new file mode 100644 index 0000000000..b4f095cbb7 --- /dev/null +++ b/data/2024/aaai/Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence beyond the Minty Property @@ -0,0 +1 @@ +Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary epsilon-NE in O(1/epsilon^2) iterations, where O suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games. \ No newline at end of file diff --git a/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..2d65829e9f --- /dev/null +++ b/data/2024/aaai/Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization b/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization new file mode 100644 index 0000000000..62903ad599 --- /dev/null +++ b/data/2024/aaai/Optimize & Reduce: A Top-Down Approach for Image Vectorization @@ -0,0 +1 @@ +Vector image representation is a popular choice when editability and flexibility in resolution are desired. However, most images are only available in raster form, making raster-to-vector image conversion (vectorization) an important task. Classical methods for vectorization are either domain-specific or yield an abundance of shapes which limits editability and interpretability. Learning-based methods, that use differentiable rendering, have revolutionized vectorization, at the cost of poor generalization to out-of-training distribution domains, and optimization-based counterparts are either slow or produce non-editable and redundant shapes. In this work, we propose Optimize & Reduce (O&R), a top-down approach to vectorization that is both fast and domain-agnostic. O&R aims to attain a compact representation of input images by iteratively optimizing Bezier curve parameters and significantly reducing the number of shapes, using a devised importance measure. We contribute a benchmark of five datasets comprising images from a broad spectrum of image complexities - from emojis to natural-like images. Through extensive experiments on hundreds of images, we demonstrate that our method is domain agnostic and outperforms existing works in both reconstruction and perceptual quality for a fixed number of shapes. Moreover, we show that our algorithm is x10 faster than the state-of-the-art optimization-based method. Our code is publicly available: https://github.com/ajevnisek/optimize-and-reduce \ No newline at end of file diff --git a/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems b/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems new file mode 100644 index 0000000000..11afc3a125 --- /dev/null +++ b/data/2024/aaai/Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems @@ -0,0 +1 @@ +The Alternating Direction Method of Multipliers (ADMM) has gained significant attention across a broad spectrum of machine learning applications. Incorporating the over-relaxation technique shows potential for enhancing the convergence rate of ADMM. However, determining optimal algorithmic parameters, including both the associated penalty and relaxation parameters, often relies on empirical approaches tailored to specific problem domains and contextual scenarios. Incorrect parameter selection can significantly hinder ADMM's convergence rate. To address this challenge, in this paper we first propose a general approach to optimize the value of penalty parameter, followed by a novel closed-form formula to compute the optimal relaxation parameter in the context of linear quadratic problems (LQPs). We then experimentally validate our parameter selection methods through random instantiations and diverse imaging applications, encompassing diffeomorphic image registration, image deblurring, and MRI reconstruction. \ No newline at end of file diff --git a/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization b/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization new file mode 100644 index 0000000000..eac73aeb1b --- /dev/null +++ b/data/2024/aaai/Optimizing IT FinOps and Sustainability through Unsupervised Workload Characterization @@ -0,0 +1 @@ +The widespread adoption of public and hybrid clouds, along with elastic resources and various automation tools for dynamic deployment, has accelerated the rapid provisioning of compute resources as needed. Despite these advancements, numerous resources persist unnecessarily due to factors such as poor digital hygiene, risk aversion, or the absence of effective tools, resulting in substantial costs and energy consumption. Existing threshold-based techniques prove inadequate in effectively addressing this challenge. To address this issue, we propose an unsupervised machine learning framework to automatically identify resources that can be de-provisioned completely or summoned on a schedule. Application of this approach to enterprise data has yielded promising initial results, facilitating the segregation of productive workloads with recurring demands from non-productive ones. \ No newline at end of file diff --git a/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes b/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes new file mode 100644 index 0000000000..4f561f6c7f --- /dev/null +++ b/data/2024/aaai/Optimizing Local Satisfaction of Long-Run Average Objectives in Markov Decision Processes @@ -0,0 +1 @@ +Long-run average optimization problems for Markov decision processes (MDPs) require constructing policies with optimal steady-state behavior, i.e., optimal limit frequency of visits to the states. However, such policies may suffer from local instability in the sense that the frequency of states visited in a bounded time horizon along a run differs significantly from the limit frequency. In this work, we propose an efficient algorithmic solution to this problem. \ No newline at end of file diff --git a/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract) b/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract) new file mode 100644 index 0000000000..b73aae8f7f --- /dev/null +++ b/data/2024/aaai/Optimizing Recall in Deep Graph Hashing Framework for Item Retrieval (Student Abstract) @@ -0,0 +1 @@ +Hashing-based recommendation (HR) methods, whose core idea is mapping users and items into hamming space, are common practice to improve item retrieval efficiency. However, existing HR fails to align optimization objective (i.e., Bayesian Personalized Ranking) and evaluation metric (i.e., Recall), leading to suboptimal performance. In this paper, we propose a smooth recall loss (termed as SRLoss), which targets Recall as the optimization objective. Due to the existence of discrete constraints, the optimization problem is NP-hard. To this end, we propose an approximation-adjustable gradient estimator to solve our problem. Experimental Results demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting b/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting new file mode 100644 index 0000000000..10242e41d4 --- /dev/null +++ b/data/2024/aaai/Optimizing the Optimization of Planning Domains by Automatic Action Schema Splitting @@ -0,0 +1,7 @@ +Most planners are based on grounding, that is, generating all instances of a parameterized action during a preprocessing phase. +For some problems the number of ground actions is too high, causing a performance bottleneck. +Building upon an existing approach, we present an enhanced method to split action schemas automatically during the grounding phase, to reduce the number of ground actions. +First, we propose to exploit the structural knowledge of the problems to have a more informative dependency graph. +Then, we suggest a better objective function to define and choose the best split. +Finally, we present a more effective search to find it. +We experimentally measure the impact of each of these improvements, and show that our approach significantly outperforms the state of the art. \ No newline at end of file diff --git a/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud b/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud new file mode 100644 index 0000000000..f10a59e247 --- /dev/null +++ b/data/2024/aaai/Orthogonal Dictionary Guided Shape Completion Network for Point Cloud @@ -0,0 +1 @@ +Point cloud shape completion, which aims to reconstruct the missing regions of the incomplete point clouds with plausible shapes, is an ill-posed and challenging task that benefits many downstream 3D applications. Prior approaches achieve this goal by employing a two-stage completion framework, generating a coarse yet complete seed point cloud through an encoder-decoder network, followed by refinement and upsampling. However, the encoded features suffer from information loss of the missing portion, leading to an inability of the decoder to reconstruct seed points with detailed geometric clues. To tackle this issue, we propose a novel Orthogonal Dictionary Guided Shape Completion Network (ODGNet). The proposed ODGNet consists of a Seed Generation U-Net, which leverages multi-level feature extraction and concatenation to significantly enhance the representation capability of seed points, and Orthogonal Dictionaries that can learn shape priors from training samples and thus compensate for the information loss of the missing portions during inference. Our design is simple but to the point, extensive experiment results indicate that the proposed method can reconstruct point clouds with more details and outperform previous state-of-the-art counterparts. The implementation code is available at https://github.com/corecai163/ODGNet. \ No newline at end of file diff --git a/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation b/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation new file mode 100644 index 0000000000..462164d0c4 --- /dev/null +++ b/data/2024/aaai/Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation @@ -0,0 +1 @@ +Adversarial Robustness Distillation (ARD) is a promising task to solve the issue of limited adversarial robustness of small capacity models while optimizing the expensive computational costs of Adversarial Training (AT). Despite the good robust performance, the existing ARD methods are still impractical to deploy in natural high-security scenes due to these methods rely entirely on original or publicly available data with a similar distribution. In fact, these data are almost always private, specific, and distinctive for scenes that require high robustness. To tackle these issues, we propose a challenging but significant task called Data-Free Adversarial Robustness Distillation (DFARD), which aims to train small, easily deployable, robust models without relying on data. We demonstrate that the challenge lies in the lower upper bound of knowledge transfer information, making it crucial to mining and transferring knowledge more efficiently. Inspired by human education, we design a plug-and-play Interactive Temperature Adjustment (ITA) strategy to improve the efficiency of knowledge transfer and propose an Adaptive Generator Balance (AGB) module to retain more data information. Our method uses adaptive hyperparameters to avoid a large number of parameter tuning, which significantly outperforms the combination of existing techniques. Meanwhile, our method achieves stable and reliable performance on multiple benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning b/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning new file mode 100644 index 0000000000..fe37ceaaa2 --- /dev/null +++ b/data/2024/aaai/Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated Outlier Class Learning @@ -0,0 +1 @@ +Existing out-of-distribution (OOD) methods have shown great success on balanced datasets but become ineffective in long-tailed recognition (LTR) scenarios where 1) OOD samples are often wrongly classified into head classes and/or 2) tail-class samples are treated as OOD samples. To address these issues, current studies fit a prior distribution of auxiliary/pseudo OOD data to the long-tailed in-distribution (ID) data. However, it is difficult to obtain such an accurate prior distribution given the unknowingness of real OOD samples and heavy class imbalance in LTR. A straightforward solution to avoid the requirement of this prior is to learn an outlier class to encapsulate the OOD samples. The main challenge is then to tackle the aforementioned confusion between OOD samples and head/tail-class samples when learning the outlier class. To this end, we introduce a novel calibrated outlier class learning (COCL) approach, in which 1) a debiased large margin learning method is introduced in the outlier class learning to distinguish OOD samples from both head and tail classes in the representation space and 2) an outlier-class-aware logit calibration method is defined to enhance the long-tailed classification confidence. Extensive empirical results on three popular benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms existing state-of-the-art OOD detection methods in LTR while being able to improve the classification accuracy on ID data. Code is available at https://github.com/mala-lab/COCL. \ No newline at end of file diff --git a/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data b/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data new file mode 100644 index 0000000000..5eb01487b7 --- /dev/null +++ b/data/2024/aaai/Outlier Ranking for Large-Scale Public Health Data @@ -0,0 +1 @@ +Disease control experts inspect public health data streams daily for outliers worth investigating, like those corresponding to data quality issues or disease outbreaks. However, they can only examine a few of the thousands of maximally-tied outliers returned by univariate outlier detection methods applied to large-scale public health data streams. To help experts distinguish the most important outliers from these thousands of tied outliers, we propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams. Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics in a human-expert evaluation using public health data streams. Most importantly, experts have used our open-source Python implementation since April 2023 and report identifying outliers worth investigating 9.1x faster than their prior baseline. Other organizations can readily adapt this implementation to create rankings from the outputs of their tailored univariate methods across large-scale streams. \ No newline at end of file diff --git a/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL b/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL new file mode 100644 index 0000000000..aaf932518a --- /dev/null +++ b/data/2024/aaai/P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL @@ -0,0 +1,3 @@ +Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penalty-based, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives. + +This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints. \ No newline at end of file diff --git a/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning b/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning new file mode 100644 index 0000000000..cb144d249d --- /dev/null +++ b/data/2024/aaai/PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning @@ -0,0 +1 @@ +Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes. \ No newline at end of file diff --git a/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs b/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs new file mode 100644 index 0000000000..68c564057a --- /dev/null +++ b/data/2024/aaai/PAC-Bayes Generalisation Bounds for Dynamical Systems including Stable RNNs @@ -0,0 +1,4 @@ +In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. +Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. +We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. +In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN. \ No newline at end of file diff --git a/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus b/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus new file mode 100644 index 0000000000..5ff08e4e88 --- /dev/null +++ b/data/2024/aaai/PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus @@ -0,0 +1,9 @@ +We present a real-time method for robust estimation of multiple instances of geometric models from noisy data. +Geometric models such as vanishing points, planar homographies or fundamental matrices are essential for 3D scene analysis. +Previous approaches discover distinct model instances in an iterative manner, thus limiting their potential for speedup via parallel computation. +In contrast, our method detects all model instances independently and in parallel. +A neural network segments the input data into clusters representing potential model instances by predicting multiple sets of sample and inlier weights. +Using the predicted weights, we determine the model parameters for each potential instance separately in a RANSAC-like fashion. +We train the neural network via task-specific loss functions, i.e. we do not require a ground-truth segmentation of the input data. +As suitable training data for homography and fundamental matrix fitting is scarce, we additionally present two new synthetic datasets. +We demonstrate state-of-the-art performance on these as well as multiple established datasets, with inference times as small as five milliseconds per image. \ No newline at end of file diff --git a/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering b/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering new file mode 100644 index 0000000000..459bd8db2f --- /dev/null +++ b/data/2024/aaai/PC-Conv: Unifying Homophily and Heterophily with Two-Fold Filtering @@ -0,0 +1 @@ +Recently, many carefully designed graph representation learning methods have achieved impressive performance on either strong heterophilic or homophilic graphs, but not both. Therefore, they are incapable of generalizing well across real-world graphs with different levels of homophily. This is attributed to their neglect of homophily in heterophilic graphs, and vice versa. In this paper, we propose a two-fold filtering mechanism to mine homophily in heterophilic graphs, and vice versa. In particular, we extend the graph heat equation to perform heterophilic aggregation of global information from a long distance. The resultant filter can be exactly approximated by the Possion-Charlier (PC) polynomials. To further exploit information at multiple orders, we introduce a powerful graph convolution PC-Conv and its instantiation PCNet for the node classification task. Compared to the state-of-the-art GNNs, PCNet shows competitive performance on well-known homophilic and heterophilic graphs. Our implementation is available at https://github.com/uestclbh/PC-Conv. \ No newline at end of file diff --git a/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation b/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation new file mode 100644 index 0000000000..4b924abddc --- /dev/null +++ b/data/2024/aaai/PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation @@ -0,0 +1 @@ +The lack of large-scale data seriously hinders the development of palmprint recognition. Recent approaches address this issue by generating large-scale realistic pseudo palmprints from Bézier curves. However, the significant difference between Bézier curves and real palmprints limits their effectiveness. In this paper, we divide the Bézier-Real difference into creases and texture differences, thus reducing the generation difficulty. We introduce a new palm crease energy (PCE) domain as a bridge from Bézier curves to real palmprints and propose a two-stage generation model. The first stage generates PCE images (realistic creases) from Bézier curves, and the second stage outputs realistic palmprints (realistic texture) with PCE images as input. In addition, we also design a lightweight plug-and-play line feature enhancement block to facilitate domain transfer and improve recognition performance. Extensive experimental results demonstrate that the proposed method surpasses state-of-the-art methods. Under extremely few data settings like 40 IDs (only 2.5% of the total training set), our model achieves a 29% improvement over RPG-Palm and outperforms ArcFace with 100% training set by more than 6% in terms of TAR@FAR=1e-6. \ No newline at end of file diff --git a/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion b/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion new file mode 100644 index 0000000000..4d9f704d6c --- /dev/null +++ b/data/2024/aaai/PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion @@ -0,0 +1 @@ +The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely transport equation. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (PDE with Adaptive Distributional Diffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated through extensive experimental settings, demonstrating its superior performance compared to state-of-the-art methods. Our code is available at https://github.com/yuanyige/pde-add. \ No newline at end of file diff --git a/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance b/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance new file mode 100644 index 0000000000..a84e299adc --- /dev/null +++ b/data/2024/aaai/PG-LBO: Enhancing High-Dimensional Bayesian Optimization with Pseudo-Label and Gaussian Process Guidance @@ -0,0 +1 @@ +Variational Autoencoder based Bayesian Optimization (VAE-BO) has demonstrated its excellent performance in addressing high-dimensional structured optimization problems. However, current mainstream methods overlook the potential of utilizing a pool of unlabeled data to construct the latent space, while only concentrating on designing sophisticated models to leverage the labeled data. Despite their effective usage of labeled data, these methods often require extra network structures, additional procedure, resulting in computational inefficiency. To address this issue, we propose a novel method to effectively utilize unlabeled data with the guidance of labeled data. Specifically, we tailor the pseudo-labeling technique from semi-supervised learning to explicitly reveal the relative magnitudes of optimization objective values hidden within the unlabeled data. Based on this technique, we assign appropriate training weights to unlabeled data to enhance the construction of a discriminative latent space. Furthermore, we treat the VAE encoder and the Gaussian Process (GP) in Bayesian optimization as a unified deep kernel learning process, allowing the direct utilization of labeled data, which we term as Gaussian Process guidance. This directly and effectively integrates the goal of improving GP accuracy into the VAE training, thereby guiding the construction of the latent space. The extensive experiments demonstrate that our proposed method outperforms existing VAE-BO algorithms in various optimization scenarios. Our code will be published at https://github.com/TaicaiChen/PG-LBO. \ No newline at end of file diff --git a/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer b/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer new file mode 100644 index 0000000000..a9866af3b3 --- /dev/null +++ b/data/2024/aaai/PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer @@ -0,0 +1 @@ +Fragment assembly involves restoring broken objects to their original geometries, and has many applications, such as archaeological restoration. Existing learning based frameworks have shown potential for solving part assembly problems with semantic decomposition, but cannot handle such geometrical decomposition problems. In this work, we propose a novel assembly framework, proxy level hybrid Transformer, with the core idea of using a hybrid graph to model and reason complex structural relationships between patches of fragments, dubbed as proxies. To this end, we propose a hybrid attention module, composed of intra and inter attention layers, enabling capturing of crucial contextual information within fragments and relative structural knowledge across fragments. Furthermore, we propose an adjacency aware hierarchical pose estimator, exploiting a decompose and integrate strategy. It progressively predicts adjacent probability and relative poses between fragments, and then implicitly infers their absolute poses by dynamic information integration. Extensive experimental results demonstrate that our method effectively reduces assembly errors while maintaining fast inference speed. The code is available at https://github.com/521piglet/PHFormer. \ No newline at end of file diff --git a/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks b/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks new file mode 100644 index 0000000000..274a5bdb49 --- /dev/null +++ b/data/2024/aaai/PICNN: A Pathway towards Interpretable Convolutional Neural Networks @@ -0,0 +1 @@ +Convolutional Neural Networks (CNNs) have exhibited great performance in discriminative feature learning for complex visual tasks. Besides discrimination power, interpretability is another important yet under-explored property for CNNs. One difficulty in the CNN interpretability is that filters and image classes are entangled. In this paper, we introduce a novel pathway to alleviate the entanglement between filters and image classes. The proposed pathway groups the filters in a late conv-layer of CNN into class-specific clusters. Clusters and classes are in a one-to-one relationship. Specifically, we use the Bernoulli sampling to generate the filter-cluster assignment matrix from a learnable filter-class correspondence matrix. To enable end-to-end optimization, we develop a novel reparameterization trick for handling the non-differentiable Bernoulli sampling. We evaluate the effectiveness of our method on ten widely used network architectures (including nine CNNs and a ViT) and five benchmark datasets. Experimental results have demonstrated that our method PICNN (the combination of standard CNNs with our proposed pathway) exhibits greater interpretability than standard CNNs while achieving higher or comparable discrimination power. \ No newline at end of file diff --git a/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract) b/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract) new file mode 100644 index 0000000000..0819e8fa44 --- /dev/null +++ b/data/2024/aaai/PICSR: Prototype-Informed Cross-Silo Router for Federated Learning (Student Abstract) @@ -0,0 +1 @@ +Federated Learning is an effective approach for learning from data distributed across multiple institutions. While most existing studies are aimed at improving predictive accuracy of models, little work has been done to explain knowledge differences between institutions and the benefits of collaboration. Understanding these differences is critical in cross-silo federated learning domains, e.g., in healthcare or banking, where each institution or silo has a different underlying distribution and stakeholders want to understand how their institution compares to their partners. We introduce Prototype-Informed Cross-Silo Router (PICSR) which utilizes a mixture of experts approach to combine local models derived from multiple silos. Furthermore, by computing data similarity to prototypical samples from each silo, we are able to ground the router’s predictions in the underlying dataset distributions. Experiments on a real-world heart disease prediction dataset show that PICSR retains high performance while enabling further explanations on the differences among institutions compared to a single black-box model. \ No newline at end of file diff --git a/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation b/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation new file mode 100644 index 0000000000..dbadbd1186 --- /dev/null +++ b/data/2024/aaai/PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation @@ -0,0 +1,5 @@ +Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods +will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to +obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, Pm-INR, for the outdoor unbounded large-scale scene. The core of our method is multi-modal prior extraction and crossmodal prior fusion modules. The former encodes codebooks from different modality inputs and extracts valuable priors, while the latter fuses priors to maintain view consistency and preserve unique features among multi-modal priors. Finally, feature-rich cross-modal priors are injected into the sampling +regions to allow each region to perceive global information without filling the sampling space. Extensive experiments have demonstrated the effectiveness and robustness of our method for outdoor unbounded large-scale scene novel +view synthesis, which outperforms state-of-the-art methods in terms of PSNR, SSIM, and LPIPS. \ No newline at end of file diff --git a/data/2024/aaai/PMAC: Personalized Multi-Agent Communication b/data/2024/aaai/PMAC: Personalized Multi-Agent Communication new file mode 100644 index 0000000000..c490824aaf --- /dev/null +++ b/data/2024/aaai/PMAC: Personalized Multi-Agent Communication @@ -0,0 +1 @@ +Communication plays a crucial role in information sharing within the field of multi-agent reinforcement learning (MARL). However, how to transmit information that meets individual needs remains a long-standing challenge. Some existing work focus on using a common channel for information transfer, which limits the capability for local communication. Meanwhile, other work attempt to establish peer-to-peer communication topologies but suffer from quadratic complexity. In this paper, we propose Personalized Multi-Agent Communication (PMAC), which enables the formation of peer-to-peer communication topologies, personalized message sending, and personalized message receiving. All these modules in PMAC are performed using only multilayer perceptrons (MLPs) with linear computational complexity. Empirically, we show the strength of personalized communication in a variety of cooperative scenarios. Our approach exhibits competitive performance compared to existing methods while maintaining notable computational efficiency. \ No newline at end of file diff --git a/data/2024/aaai/PMET: Precise Model Editing in a Transformer b/data/2024/aaai/PMET: Precise Model Editing in a Transformer new file mode 100644 index 0000000000..dd27aff960 --- /dev/null +++ b/data/2024/aaai/PMET: Precise Model Editing in a Transformer @@ -0,0 +1 @@ +Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the \textsc{counterfact} and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at \url{https://github.com/xpq-tech/PMET}. \ No newline at end of file diff --git a/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition b/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition new file mode 100644 index 0000000000..152e46693c --- /dev/null +++ b/data/2024/aaai/PMRC: Prompt-Based Machine Reading Comprehension for Few-Shot Named Entity Recognition @@ -0,0 +1 @@ +The prompt-based method has been proven effective in improving the performance of pre-trained language models (PLMs) on sentence-level few-shot tasks. However, when applying prompting to token-level tasks such as Named Entity Recognition (NER), specific templates need to be designed, and all possible segments of the input text need to be enumerated. These methods have high computational complexity in both training and inference processes, making them difficult to apply in real-world scenarios. To address these issues, we redefine the NER task as a Machine Reading Comprehension (MRC) task and incorporate prompting into the MRC framework. Specifically, we sequentially insert boundary markers for various entity types into the templates and use these markers as anchors during the inference process to differentiate entity types. In contrast to the traditional multi-turn question-answering extraction in the MRC framework, our method can extract all spans of entity types in one round. Furthermore, we propose word-based template and example-based template that enhance the MRC framework's perception of entity start and end positions while significantly reducing the manual effort required for template design. It is worth noting that in cross-domain scenarios, PMRC does not require redesigning the model architecture and can continue training by simply replacing the templates to recognize entity types in the target domain. Experimental results demonstrate that our approach outperforms state-of-the-art models in low-resource settings, achieving an average performance improvement of +5.2% in settings where access to source domain data is limited. Particularly, on the ATIS dataset with a large number of entity types and 10-shot setting, PMRC achieves a performance improvement of +15.7%. Moreover, our method achieves a decoding speed 40.56 times faster than the template-based cloze-style approach. \ No newline at end of file diff --git a/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields b/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields new file mode 100644 index 0000000000..291ad463ff --- /dev/null +++ b/data/2024/aaai/PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields @@ -0,0 +1 @@ +Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) has been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRF for data augmentation to improve the regression model training, and their performances on novel viewpoints and appearances are still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, i.e., PNeRFLoc, based on a unified point-based representation. On one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also developed an efficient rendering-based framework with a warping loss function. Extensive experiments demonstrate that PNeRFLoc performs the best on the synthetic dataset when the 3D NeRF model can be well learned, and significantly outperforms all the NeRF-boosted localization methods with on-par SOTA performance on the real-world benchmark localization datasets. Project webpage: https://zju3dv.github.io/PNeRFLoc/. \ No newline at end of file diff --git a/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping b/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping new file mode 100644 index 0000000000..0e93cfdf45 --- /dev/null +++ b/data/2024/aaai/PNeSM: Arbitrary 3D Scene Stylization via Prompt-Based Neural Style Mapping @@ -0,0 +1 @@ +3D scene stylization refers to transform the appearance of a 3D scene to match a given style image, ensuring that images rendered from different viewpoints exhibit the same style as the given style image, while maintaining the 3D consistency of the stylized scene. Several existing methods have obtained impressive results in stylizing 3D scenes. However, the mod- els proposed by these methods need to be re-trained when applied to a new scene. In other words, their models are cou- pled with a specific scene and cannot adapt to arbitrary other scenes. To address this issue, we propose a novel 3D scene stylization framework to transfer an arbitrary style to an ar- bitrary scene, without any style-related or scene-related re- training. Concretely, we first map the appearance of the 3D scene into a 2D style pattern space, which realizes complete disentanglement of the geometry and appearance of the 3D scene and makes our model be generalized to arbitrary 3D scenes. Then we stylize the appearance of the 3D scene in the 2D style pattern space via a prompt-based 2D stylization al- gorithm. Experimental results demonstrate that our proposed framework is superior to SOTA methods in both visual qual- ity and generalization. \ No newline at end of file diff --git a/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning b/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning new file mode 100644 index 0000000000..0fdb59162b --- /dev/null +++ b/data/2024/aaai/PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning @@ -0,0 +1 @@ +Despite many breakthroughs in recent years, it is still hard for MultiAgent Reinforcement Learning (MARL) algorithms to directly solve complex tasks in MultiAgent Systems (MASs) from scratch. In this work, we study how to use Automatic Curriculum Learning (ACL) to reduce the number of environmental interactions required to learn a good policy. In order to solve a difficult task, ACL methods automatically select a sequence of tasks (i.e., curricula). The idea is to obtain maximum learning progress towards the final task by continuously learning on tasks that match the current capabilities of the learners. The key question is how to measure the learning progress of the learner for better curriculum selection. We propose a novel ACL framework, PrOgRessive mulTiagent Automatic curricuLum (PORTAL), for MASs. PORTAL selects curricula according to two critera: 1) How difficult is a task, relative to the learners’ current abilities? 2) How similar is a task, relative to the final task? By learning a shared feature space between tasks, PORTAL is able to characterize different tasks based on the distribution of features and select those that are similar to the final task. Also, the shared feature space can effectively facilitate the policy transfer between curricula. Experimental results show that PORTAL can train agents to master extremely hard cooperative tasks, which can not be achieved with previous state-of-the-art MARL algorithms. \ No newline at end of file diff --git a/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation b/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation new file mode 100644 index 0000000000..4748942b41 --- /dev/null +++ b/data/2024/aaai/PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation @@ -0,0 +1,2 @@ +Self-supervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations. +To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets. \ No newline at end of file diff --git a/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine b/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine new file mode 100644 index 0000000000..c5e3d9b158 --- /dev/null +++ b/data/2024/aaai/PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine @@ -0,0 +1 @@ +As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Prompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available. \ No newline at end of file diff --git a/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning b/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning new file mode 100644 index 0000000000..5af0626f7d --- /dev/null +++ b/data/2024/aaai/PRP Rebooted: Advancing the State of the Art in FOND Planning @@ -0,0 +1 @@ +Fully Observable Non-Deterministic (FOND) planning is a variant of classical symbolic planning in which actions are nondeterministic, with an action's outcome known only upon execution. It is a popular planning paradigm with applications ranging from robot planning to dialogue-agent design and reactive synthesis. Over the last 20 years, a number of approaches to FOND planning have emerged. In this work, we establish a new state of the art, following in the footsteps of some of the most powerful FOND planners to date. Our planner, PR2, decisively outperforms the four leading FOND planners, at times by a large margin, in 17 of 18 domains that represent a comprehensive benchmark suite. Ablation studies demonstrate the impact of various techniques we introduce, with the largest improvement coming from our novel FOND-aware heuristic. \ No newline at end of file diff --git a/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction b/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction new file mode 100644 index 0000000000..ffc9972df4 --- /dev/null +++ b/data/2024/aaai/PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction @@ -0,0 +1 @@ +Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging ``Unseen-Both" setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality protein data, PSC-CPI still exhibits comparable or even better performance than previous approaches. \ No newline at end of file diff --git a/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks b/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks new file mode 100644 index 0000000000..edf383877b --- /dev/null +++ b/data/2024/aaai/PTMQ: Post-training Multi-Bit Quantization of Neural Networks @@ -0,0 +1 @@ +The ability of model quantization with arbitrary bit-width to dynamically meet diverse bit-width requirements during runtime has attracted significant attention. Recent research has focused on optimizing large-scale training methods to achieve robust bit-width adaptation, which is a time-consuming process requiring hundreds of GPU hours. Furthermore, converting bit-widths requires recalculating statistical parameters of the norm layers, thereby impeding real-time switching of the bit-width. To overcome these challenges, we propose an efficient Post-Training Multi-bit Quantization (PTMQ) scheme that requires only a small amount of calibration data to perform block-wise reconstruction of multi-bit quantization errors. It eliminates the influence of statistical parameters by fusing norm layers, and supports real-time switching bit-widths in uniform quantization and mixed-precision quantization. To improve quantization accuracy and robustness, we propose a Multi-bit Feature Mixer technique (MFM) for fusing features of different bit-widths to enhance robustness across varying bit-widths. Moreover, we introduced the Group-wise Distillation Loss (GD-Loss) to enhance the correlation between different bit-width groups and further improve the overall performance of PTMQ. Extensive experiments demonstrate that PTMQ achieves comparable performance to existing state-of-the-art post-training quantization methods, while optimizing it speeds up by 100$\times$ compared to recent multi-bit quantization works. Code can be available at https://github.com/xuke225/PTMQ. \ No newline at end of file diff --git a/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping b/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping new file mode 100644 index 0000000000..02d4dd6e03 --- /dev/null +++ b/data/2024/aaai/PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping @@ -0,0 +1 @@ +Talking upper-body synthesis is a promising task due to its versatile potential for video creation and consists of animating the body and face from a source image with the motion from a given driving video. However, prior synthesis approaches fall short in addressing this task and have been either limited to animating heads of a target person only, or have animated the upper body but neglected the synthesis of precise facial details. To tackle this task, we propose a Photo-realistic Talking Upper-body Synthesis method via 3D-aware motion decomposition warping, named PTUS, to both precisely synthesize the upper body as well as recover the details of the face such as blinking and lip synchronization. In particular, the motion decomposition mechanism consists of a face-body motion decomposition, which decouples the 3D motion estimation of the face and body, and a local-global motion decomposition, which decomposes the 3D face motion into global and local motions resulting in the transfer of facial expression. The 3D-aware warping module transfers the large-scale and subtle 3D motions to the extracted 3D depth-aware features in a coarse-tofine manner. Moreover, we present a new dataset, Talking-UB, which includes upper-body images with high-resolution faces, addressing the limitations of prior datasets that either consist of only facial images or upper-body images with blurry faces. Experimental results demonstrate that our proposed method can synthesize high-quality videos that preserve facial details, and achieves superior results compared to state-of-the-art cross-person motion transfer approaches. Code and collected dataset are released in https://github.com/cooluoluo/PTUS. \ No newline at end of file diff --git a/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment b/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment new file mode 100644 index 0000000000..99ab6f5838 --- /dev/null +++ b/data/2024/aaai/PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment @@ -0,0 +1 @@ +Monocular 3D lane detection is essential for a reliable autonomous driving system and has recently been rapidly developing. Existing popular methods mainly employ a predefined 3D anchor for lane detection based on front-viewed (FV) space, aiming to mitigate the effects of view transformations. However, the perspective geometric distortion between FV and 3D space in this FV-based approach introduces extremely dense anchor designs, which ultimately leads to confusing lane representations. In this paper, we introduce a novel prior-guided perspective on lane detection and propose an end-to-end framework named PVALane, which utilizes 2D prior knowledge to achieve precise and efficient 3D lane detection. Since 2D lane predictions can provide strong priors for lane existence, PVALane exploits FV features to generate sparse prior anchors with potential lanes in 2D space. These dynamic prior anchors help PVALane to achieve distinct lane representations and effectively improve the precision of PVALane due to the reduced lane search space. Additionally, by leveraging these prior anchors and representing lanes in both FV and bird-eye-viewed (BEV) spaces, we effectively align and merge semantic and geometric information from FV and BEV features. Extensive experiments conducted on the OpenLane and ONCE-3DLanes datasets demonstrate the superior performance of our method compared to existing state-of-the-art approaches and exhibit excellent robustness. \ No newline at end of file diff --git a/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation b/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation new file mode 100644 index 0000000000..eaabdbe03b --- /dev/null +++ b/data/2024/aaai/PaintHuman: Towards High-Fidelity Text-to-3D Human Texturing via Denoised Score Distillation @@ -0,0 +1 @@ +Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (e.g., SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leveraging existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two perspectives. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guide to ensure that the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art (SoTA) methods, validate the efficacy of our approach.Project page: https://painthuman.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects b/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects new file mode 100644 index 0000000000..4c487abefd --- /dev/null +++ b/data/2024/aaai/Painterly Image Harmonization by Learning from Painterly Objects @@ -0,0 +1 @@ +Given a composite image with photographic object and painterly background, painterly image harmonization targets at stylizing the composite object to be compatible with the background. Despite the competitive performance of existing painterly harmonization works, they did not fully leverage the painterly objects in artistic paintings. In this work, we explore learning from painterly objects for painterly image harmonization. In particular, we learn a mapping from background style and object information to object style based on painterly objects in artistic paintings. With the learnt mapping, we can hallucinate the target style of composite object, which is used to harmonize encoder feature maps to produce the harmonized image. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion b/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion new file mode 100644 index 0000000000..d6268d7cc3 --- /dev/null +++ b/data/2024/aaai/Pairwise-Label-Based Deep Incremental Hashing with Simultaneous Code Expansion @@ -0,0 +1,4 @@ +Deep incremental hashing has become a subject of considerable interest due to its capability to learn hash codes in an incremental manner, eliminating the need to generate codes for classes that have already been learned. However, accommodating more classes requires longer hash codes, and regenerating database codes becomes inevitable when code expansion is required. +In this paper, we present a unified deep hash framework that can simultaneously learn new classes and increase hash code capacity. Specifically, we design a triple-channel asymmetric framework to optimize a new CNN model with a target code length and a code projection matrix. This enables us to directly generate hash codes for new images, and efficiently generate expanded hash codes for original database images from the old ones with the learned projection matrix. +Meanwhile, we propose a pairwise-label-based incremental similarity-preserving loss to optimize the new CNN model, which can incrementally preserve new similarities while maintaining the old ones. Additionally, we design a double-end quantization loss to reduce the quantization error from new and original query images. As a result, our method efficiently embeds both new and original similarities into the expanded hash codes, while keeping the original database codes unchanged. +We conduct extensive experiments on three widely-used image retrieval benchmarks, demonstrating that our method can significantly reduce the time required to expand existing database codes, while maintaining state-of-the-art retrieval performance. \ No newline at end of file diff --git a/data/2024/aaai/Pandora's Problem with Deadlines b/data/2024/aaai/Pandora's Problem with Deadlines new file mode 100644 index 0000000000..e81324a7dd --- /dev/null +++ b/data/2024/aaai/Pandora's Problem with Deadlines @@ -0,0 +1,3 @@ +Pandora’s problem is a fundamental model that studies optimal search under costly inspection. In the classic version, there are n boxes, each associated with a known cost and a known distribution over values. A strategy inspects the boxes sequentially and obtains a utility that equals the difference between the maximum value of an inspected box and the total inspection cost. Weitzman (1979) presented a surprisingly simple strategy that obtains the optimal expected utility. + +In this work we introduce a new variant of Pandora’s problem in which every box is also associated with a publicly known deadline, indicating the final round by which its value may be chosen. This model captures many real-life scenarios where alternatives admit deadlines, such as candidate interviews and college admissions. Our main result is an efficient threshold-based strategy that achieves a constant approximation relative to the performance of the optimal strategy for the deadlines setting. \ No newline at end of file diff --git a/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images b/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images new file mode 100644 index 0000000000..c5cc1da51a --- /dev/null +++ b/data/2024/aaai/Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images @@ -0,0 +1 @@ +Panoramic imaging research on geometry recovery and High Dynamic Range (HDR) reconstruction becomes a trend with the development of Extended Reality (XR). Neural Radiance Fields (NeRF) provide a promising scene representation for both tasks without requiring extensive prior data. How- ever, in the case of inputting sparse Low Dynamic Range (LDR) panoramic images, NeRF often degrades with under-constrained geometry and is unable to reconstruct HDR radiance from LDR inputs. We observe that the radiance from each pixel in panoramic images can be modeled as both a signal to convey scene lighting information and a light source to illuminate other pixels. Hence, we propose the irradiance fields from sparse LDR panoramic images, which increases the observation counts for faithful geometry recovery and leverages the irradiance-radiance attenuation for HDR reconstruction. Extensive experiments demonstrate that the irradiance fields outperform state-of-the-art methods on both geometry recovery and HDR reconstruction and validate their effectiveness. Furthermore, we show a promising byproduct of spatially-varying lighting estimation. The code is available at https://github.com/Lu-Zhan/Pano-NeRF. \ No newline at end of file diff --git a/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning b/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning new file mode 100644 index 0000000000..695fed9873 --- /dev/null +++ b/data/2024/aaai/Panoptic Scene Graph Generation with Semantics-Prototype Learning @@ -0,0 +1,5 @@ +Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. +However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for the same object pairs. +Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. +To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to observe the invariance degree of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potentially biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. +Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets. Our code is released at https://github.com/lili0415/PSG-biased-annotation. \ No newline at end of file diff --git a/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models b/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models new file mode 100644 index 0000000000..f0c8c6e745 --- /dev/null +++ b/data/2024/aaai/Pantypes: Diverse Representatives for Self-Explainable Models @@ -0,0 +1,2 @@ +Prototypical self-explainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. +Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness. \ No newline at end of file diff --git a/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer b/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer new file mode 100644 index 0000000000..5870c9d8bd --- /dev/null +++ b/data/2024/aaai/ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer @@ -0,0 +1 @@ +Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer. \ No newline at end of file diff --git a/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming b/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming new file mode 100644 index 0000000000..7df6fad65e --- /dev/null +++ b/data/2024/aaai/Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming @@ -0,0 +1 @@ +Domain-independent dynamic programming (DIDP), a model-based paradigm based on dynamic programming, has shown promising performance on multiple combinatorial optimization problems compared with mixed integer programming (MIP) and constraint programming (CP). The current DIDP solvers are based on heuristic search, and the state-of-the-art solver, complete anytime beam search (CABS), uses beam search. However, the current DIDP solvers cannot utilize multiple threads, unlike state-of-the-art MIP and CP solvers. In this paper, we propose three parallel beam search algorithms and develop multi-thread implementations of CABS. With 32 threads, our multi-thread DIDP solvers achieve 9 to 39 times speedup on average and significant performance improvement over the sequential solver, finding the new best solutions for two instances of the traveling salesperson problem with time windows. In addition, our solvers outperform multi-thread MIP and CP solvers in four of the six combinatorial optimization problems evaluated. \ No newline at end of file diff --git a/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency b/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency new file mode 100644 index 0000000000..cb7a6f9ff0 --- /dev/null +++ b/data/2024/aaai/Parallel Empirical Evaluations: Resilience despite Concurrency @@ -0,0 +1,2 @@ +Computational evaluations are crucial in modern problem-solving when we surpass theoretical algorithms or bounds. These experiments frequently take much work, and the sheer amount of needed resources makes it impossible to execute them on a single personal computer or laptop. Cluster schedulers allow for automatizing these tasks and scale to many computers. But, when we evaluate implementations of combinatorial algorithms, we depend on stable runtime results. Common approaches either limit parallelism or suffer from unstable runtime measurements due to interference among jobs on modern hardware. The former is inefficient and not sustainable. The latter results in unreplicable experiments. +In this work, we address this issue and offer an acceptable balance between efficiency, software, hardware complexity, reliability, and replicability. We investigate effects towards replicability stability and illustrate how to efficiently use widely employed cluster resources for parallel evaluations. Furthermore, we present solutions which mitigate issues that emerge from the concurrent execution of benchmark jobs. Our experimental evaluation shows that – despite parallel execution – our approach reduces the runtime instability on the majority of instances to one second. \ No newline at end of file diff --git a/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems b/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems new file mode 100644 index 0000000000..fa9c0482d7 --- /dev/null +++ b/data/2024/aaai/Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems @@ -0,0 +1 @@ +Creativity is the heart and soul of advertising services. Effective creatives can create a win-win scenario: advertisers each target users and achieve marketing objectives more effectively, users more quickly find products of interest, and platforms generate more advertising revenue. With the advent of AI-Generated Content, advertisers now can produce vast amounts of creative content at a minimal cost. The current challenge lies in how advertising systems can select the most pertinent creative in real-time for each user personally. Existing methods typically perform serial ranking of ads or creatives, limiting the creative module in terms of both effectiveness and efficiency. In this paper, we propose for the first time a novel architecture for online parallel estimation of ads and creatives ranking, as well as the corresponding offline joint optimization model. The online architecture enables sophisticated personalized creative modeling while reducing overall latency. The offline joint model for CTR estimation allows mutual awareness and collaborative optimization between ads and creatives. Additionally, we optimize the offline evaluation metrics for the implicit feedback sorting task involved in ad creative ranking. We conduct extensive experiments to compare ours with two state-of-the-art approaches. The results demonstrate the effectiveness of our approach in both offline evaluations and real-world advertising platforms online in terms of response time, CTR, and CPM. \ No newline at end of file diff --git a/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding b/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding new file mode 100644 index 0000000000..0a98e22ee8 --- /dev/null +++ b/data/2024/aaai/Parallel Vertex Diffusion for Unified Visual Grounding @@ -0,0 +1 @@ +Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks. \ No newline at end of file diff --git a/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph b/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph new file mode 100644 index 0000000000..4cc80ead10 --- /dev/null +++ b/data/2024/aaai/Parameterization of (Partial) Maximum Satisfiability above Matching in a Variable-Clause Graph @@ -0,0 +1 @@ +In the paper, we study the Maximum Satisfiability and the Partial Maximum Satisfiability problems. Using Gallai–Edmonds decomposition, we significantly improve the upper bound for the Maximum Satisfiability problem parameterized above maximum matching in the variable-clause graph. Our algorithm operates with a runtime of O*(2.83^k'), a substantial improvement compared to the previous approach requiring O*(4^k' ), where k' denotes the relevant parameter. Moreover, this result immediately implies O*(1.14977^m) and O*(1.27895^m) time algorithms for the (n, 3)-MaxSAT and (n, 4)-MaxSAT where m is the overall number of clauses. These upper bounds improve prior-known upper bounds equal to O*(1.1554^m) and O*(1.2872^m). We also adapt the algorithm so that it can handle instances of Partial Maximum Satisfiability without losing performance in some cases. Note that this is somewhat surprising, as the existence of even one hard clause can significantly increase the hardness of a problem. \ No newline at end of file diff --git a/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants b/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants new file mode 100644 index 0000000000..3d4bab0563 --- /dev/null +++ b/data/2024/aaai/Parameterized Approximation Algorithms for Sum of Radii Clustering and Variants @@ -0,0 +1,3 @@ +Clustering is one of the most fundamental tools in artificial intelligence, machine learning, and data mining. In this paper, we follow one of the recent mainstream topics of clustering, Sum of Radii (SoR), which naturally arises as a balance between the folklore k-center and k-median. SoR aims to determine a set of k balls, each centered at a point in a given dataset, such that their union covers the entire dataset while minimizing the sum of radii of the k balls. +We propose a general technical framework to overcome the challenge posed by varying radii in SoR, which yields fixed-parameter tractable (fpt) algorithms with respect to k (i.e., whose running time is f(k) ploy(n) for some f). +Our framework is versatile and obtains fpt approximation algorithms with constant approximation ratios for SoR as well as its variants in general metrics, such as Fair SoR and Matroid SoR, which significantly improve the previous results. \ No newline at end of file diff --git a/data/2024/aaai/Parameterized Projected Bellman Operator b/data/2024/aaai/Parameterized Projected Bellman Operator new file mode 100644 index 0000000000..5f9100cf0a --- /dev/null +++ b/data/2024/aaai/Parameterized Projected Bellman Operator @@ -0,0 +1 @@ +Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems. \ No newline at end of file diff --git a/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization b/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization new file mode 100644 index 0000000000..4b42cff480 --- /dev/null +++ b/data/2024/aaai/Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization @@ -0,0 +1 @@ +We consider the problem of multi-objective optimization (MOO) of expensive black-box functions with the goal of discovering high-quality and diverse Pareto fronts where we are allowed to evaluate a batch of inputs. This problem arises in many real-world applications including penicillin production where diversity of solutions is critical. We solve this problem in the framework of Bayesian optimization (BO) and propose a novel approach referred to as Pareto front-Diverse Batch Multi-Objective BO (PDBO). PDBO tackles two important challenges: 1) How to automatically select the best acquisition function in each BO iteration, and 2) How to select a diverse batch of inputs by considering multiple objectives. We propose principled solutions to address these two challenges. First, PDBO employs a multi-armed bandit approach to select one acquisition function from a given library. We solve a cheap MOO problem by assigning the selected acquisition function for each expensive objective function to obtain a candidate set of inputs for evaluation. Second, it utilizes Determinantal Point Processes (DPPs) to choose a Pareto-front-diverse batch of inputs for evaluation from the candidate set obtained from the first step. The key parameters for the methods behind these two steps are updated after each round of function evaluations. Experiments on multiple MOO benchmarks demonstrate that PDBO outperforms prior methods in terms of both the quality and diversity of Pareto solutions. \ No newline at end of file diff --git a/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency b/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency new file mode 100644 index 0000000000..4364750894 --- /dev/null +++ b/data/2024/aaai/Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency @@ -0,0 +1,2 @@ +Although recent methods in Unsupervised Domain Adaptation (UDA) have achieved success in segmenting rainy or snowy scenes by improving consistency, they face limitations when dealing with more challenging scenarios like foggy and night scenes. We argue that these prior methods excessively focus on weather-specific features in adverse scenes, which exacerbates the existing domain gaps. +To address this issue, we propose a new metric to evaluate the severity of all adverse scenes and offer a novel perspective that enables task unification across all adverse scenarios. Our method focuses on Severity, allowing our model to learn more consistent features and facilitate domain distribution alignment, thereby alleviating domain gaps. Unlike the vague descriptions of consistency in previous methods, we introduce Cross-domain Consistency, which is quantified using the Structure Similarity Index Measure (SSIM) to measure the distance between the source and target domains. Specifically, our unified model consists of two key modules: the Merging Style Augmentation Module (MSA) and the Severity Perception Mask Module (SPM). The MSA module transforms all adverse scenes into augmented scenes, effectively eliminating weather-specific features and enhancing Cross-domain Consistency. The SPM module incorporates a Severity Perception mechanism, guiding a Mask operation that enables our model to learn highly consistent features from the augmented scenes. Our unified framework, named PASS (Parsing All adverSe Scenes), achieves significant performance improvements over state-of-the-art methods on widely-used benchmarks for all adverse scenes. Notably, the performance of PASS is superior to Semi-Unified models and even surpasses weather-specific models. \ No newline at end of file diff --git a/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network b/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network new file mode 100644 index 0000000000..ce63def494 --- /dev/null +++ b/data/2024/aaai/Partial Multi-View Clustering via Self-Supervised Network @@ -0,0 +1,2 @@ +Partial multi-view clustering is a challenging and practical research problem for data analysis in real-world applications, due to the potential data missing issue in different views. However, most existing methods have not fully explored the correlation information among various incomplete views. In addition, these existing clustering methods always ignore discovering discriminative features inside the data itself in this unsupervised task. To tackle these challenges, we propose Partial Multi-View Clustering via Self-Supervised \textbf{N}etwork (PVC-SSN) in this paper. +Specifically, we employ contrastive learning to obtain a more discriminative and consistent subspace representation, which is guided by a self-supervised module. Self-supervised learning can exploit effective cluster information through the data itself to guide the learning process of clustering tasks. Thus, it can pull together embedding features from the same cluster and push apart these from different clusters. Extensive experiments on several benchmark datasets show that the proposed PVC-SCN method outperforms several state-of-the-art clustering methods. \ No newline at end of file diff --git a/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract) b/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract) new file mode 100644 index 0000000000..b4046fba00 --- /dev/null +++ b/data/2024/aaai/Partially Observable Hierarchical Reinforcement Learning with AI Planning (Student Abstract) @@ -0,0 +1 @@ +Partially observable Markov decision processes (POMDPs) challenge reinforcement learning agents due to incomplete knowledge of the environment. Even assuming monotonicity in uncertainty, it is difficult for an agent to know how and when to stop exploring for a given task. In this abstract, we discuss how to use hierarchical reinforcement learning (HRL) and AI Planning (AIP) to improve exploration when the agent knows possible valuations of unknown predicates and how to discover them. By encoding the uncertainty in an abstract planning model, the agent can derive a high-level plan which is then used to decompose the overall POMDP into a tree of semi-POMDPs for training. We evaluate our agent's performance on the MiniGrid domain and show how guided exploration may improve agent performance. \ No newline at end of file diff --git a/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections b/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections new file mode 100644 index 0000000000..262fa552c2 --- /dev/null +++ b/data/2024/aaai/Participation Incentives in Approval-Based Committee Elections @@ -0,0 +1,16 @@ +In approval-based committee (ABC) voting, the goal is to +choose a subset of predefined size of the candidates based on +the voters’ approval preferences over the candidates. While +this problem has attracted significant attention in recent years, +the incentives for voters to participate in an election for a +given ABC voting rule have been neglected so far. This paper +is thus the first to explicitly study this property, typically called +participation, for ABC voting rules. In particular, we show +that all ABC scoring rules even satisfy group participation, +whereas most sequential rules severely fail participation. We +furthermore explore several escape routes to the impossibility +for sequential ABC voting rules: we prove for many sequential +rules that (i) they satisfy participation on laminar profiles, (ii) +voters who approve none of the elected candidates cannot +benefit by abstaining, and (iii) it is NP-hard for a voter to +decide whether she benefits from abstaining \ No newline at end of file diff --git a/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract) b/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract) new file mode 100644 index 0000000000..d527bca59c --- /dev/null +++ b/data/2024/aaai/Pass-Efficient Algorithms for Graph Spectral Clustering (Student Abstract) @@ -0,0 +1,2 @@ +Graph spectral clustering is a fundamental technique in data analysis, which utilizes eigenpairs of the Laplacian matrix to partition graph vertices into clusters. However, classical spectral clustering algorithms require eigendecomposition of the Laplacian matrix, which has cubic time complexity. In +this work, we describe pass-efficient spectral clustering algorithms that leverage recent advances in randomized eigendecomposition and the structure of the graph vertex-edge matrix. Furthermore, we derive formulas for their efficient implementation. The resulting algorithms have a linear time complexity with respect to the number of vertices and edges and pass over the graph constant times, making them suitable for processing large graphs stored on slow memory. Experiments validate the accuracy and efficiency of the algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling b/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling new file mode 100644 index 0000000000..587ca20583 --- /dev/null +++ b/data/2024/aaai/Patch-Aware Sample Selection for Efficient Masked Image Modeling @@ -0,0 +1 @@ +Nowadays sample selection is drawing increasing attention. By extracting and training only on the most informative subset, sample selection can effectively reduce the training cost. Although sample selection is effective in conventional supervised learning, applying it to Masked Image Modeling (MIM) still poses challenges due to the gap between sample-level selection and patch-level pre-training. In this paper, we inspect the sample selection in MIM pre-training and find the basic selection suffers from performance degradation. We attribute this degradation primarily to 2 factors: the random mask strategy and the simple averaging function. We then propose Patch-Aware Sample Selection (PASS), including a low-cost Dynamic Trained Mask Predictor (DTMP) and Weighted Selection Score (WSS). DTMP consistently masks the informative patches in samples, ensuring a relatively accurate representation of selection score. WSS enhances the selection score using patch-level disparity. Extensive experiments show the effectiveness of PASS in selecting the most informative subset and accelerating pretraining. PASS exhibits superior performance across various datasets, MIM methods, and downstream tasks. Particularly, PASS improves MAE by 0.7% on ImageNet-1K while utilizing only 37% data budget and achieves ~1.7x speedup. \ No newline at end of file diff --git a/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation b/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation new file mode 100644 index 0000000000..66e0b987ac --- /dev/null +++ b/data/2024/aaai/Patch-Wise Graph Contrastive Learning for Image Translation @@ -0,0 +1 @@ +Recently, patch-wise contrastive learning is drawing attention for the image translation by exploring the semantic correspondence between the input image and the output image. To further explore the patch-wise topology for high-level semantic understanding, here we exploit the graph neural network to capture the topology-aware features. Specifically, we construct the graph based on the patch-wise similarity from a pretrained encoder, whose adjacency matrix is shared to enhance the consistency of patch-wise relation between the input and the output. Then, we obtain the node feature from the graph neural network, and enhance the correspondence between the nodes by increasing mutual information using the contrastive loss. In order to capture the hierarchical semantic structure, we further propose the graph pooling. Experimental results demonstrate the state-of-art results for the image translation thanks to the semantic encoding by the constructed graphs. \ No newline at end of file diff --git a/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology b/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology new file mode 100644 index 0000000000..6c1a988880 --- /dev/null +++ b/data/2024/aaai/PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology @@ -0,0 +1 @@ +As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology. \ No newline at end of file diff --git a/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths b/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths new file mode 100644 index 0000000000..86d81cb70f --- /dev/null +++ b/data/2024/aaai/Paths, Proofs, and Perfection: Developing a Human-Interpretable Proof System for Constrained Shortest Paths @@ -0,0 +1 @@ +People want to rely on optimization algorithms for complex decisions but verifying the optimality of the solutions can then become a valid concern, particularly for critical decisions taken by non-experts in optimization. One example is the shortest-path problem on a network, occurring in many contexts from transportation to logistics to telecommunications. While the standard shortest-path problem is both solvable in polynomial time and certifiable by duality, introducing side constraints makes solving and certifying the solutions much harder. We propose a proof system for constrained shortest-path problems, which gives a set of logical rules to derive new facts about feasible solutions. The key trait of the proposed proof system is that it specifically includes high-level graph concepts within its reasoning steps (such as connectivity or path structure), in contrast to, e.g., using linear combinations of model constraints. Thus, using our proof system, we can provide a step-by-step, human-auditable explanation showing that the path given by an external solver cannot be improved. Additionally, to maximize the advantages of this setup, we propose a proof search procedure that specifically aims to find small proofs of this form using a procedure similar to A* search. We evaluate our proof system on constrained shortest path instances generated from real-world road networks and experimentally show that we may indeed derive more interpretable proofs compared to an integer programming approach, in some cases leading to much smaller proofs. \ No newline at end of file diff --git a/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation b/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation new file mode 100644 index 0000000000..0afbe772c9 --- /dev/null +++ b/data/2024/aaai/Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation @@ -0,0 +1 @@ +Video semantic segmentation has achieved conspicuous achievements attributed to the development of deep learning, but suffers from labor-intensive annotated training data gathering. To alleviate the data-hunger issue, domain adaptation approaches are developed in the hope of adapting the model trained on the labeled synthetic videos to the real videos in the absence of annotations. By analyzing the dominant paradigm consistency regularization in the domain adaptation task, we find that the bottlenecks exist in previous methods from the perspective of pseudo-labels. To take full advantage of the information contained in the pseudo-labels and empower more effective supervision signals, we propose a coherent PAT network including a target domain focalizer and relation-aware temporal consistency. The proposed PAT network enjoys several merits. First, the target domain focalizer is responsible for paying attention to the target domain, and increasing the accessibility of pseudo-labels in consistency training. Second, the relation-aware temporal consistency aims at modeling the inter-class consistent relationship across frames to equip the model with effective supervision signals. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art domain adaptive video semantic segmentation methods. \ No newline at end of file diff --git a/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations b/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations new file mode 100644 index 0000000000..56c88ef0bb --- /dev/null +++ b/data/2024/aaai/Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations @@ -0,0 +1,2 @@ +Peer learning is a novel high-level reinforcement learning framework for agents learning in groups. While standard reinforcement learning trains an individual agent in trial-and-error fashion, all on its own, peer learning addresses a related setting in which a group of agents, i.e., peers, learns to master a task simultaneously together from scratch. Peers are allowed to communicate only about their own states and actions recommended by others: "What would you do in my situation?". Our motivation is to study the learning behavior of these agents. +We formalize the teacher selection process in the action advice setting as a multi-armed bandit problem and therefore highlight the need for exploration. Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents' performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a state-of-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous OpenAI Gym domains. Doing so, we also show that within such a framework complex policies from action recommendations beyond discrete action spaces can evolve. \ No newline at end of file diff --git a/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search b/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search new file mode 100644 index 0000000000..8056337972 --- /dev/null +++ b/data/2024/aaai/PerFedRLNAS: One-for-All Personalized Federated Neural Architecture Search @@ -0,0 +1 @@ +Personalized federated learning is a new paradigm to address heterogeneous problems (e.g. issues with non-i.i.d. data) in federated learning. However, existing personalized federated learning methods lack standards for how personalized and shared parts of the models are designed. Sometimes, manual design can even lead to worse performance than non-personalization. As a result, we propose a new algorithm for personalized federated neural architecture search, called PerFedRLNAS, to automatically personalize the architectures and weights of models on each client. With such an algorithm, we can solve the issues of low efficiency as well as failure to adapt to new search spaces in previous federated neural architecture search work. We further show that with automatically assigning different client architectures can solve heterogeneity of data distribution, efficiency and memory in federated learning. In our experiments, we empirically show that our framework shows much better performance with respect to personalized accuracy and overall time compared to state-of-the-art methods. Furthermore, PerFedRLNAS has a good generalization ability to new clients, and is easy to be deployed in practice. \ No newline at end of file diff --git a/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization b/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization new file mode 100644 index 0000000000..7dde3b25be --- /dev/null +++ b/data/2024/aaai/Percentile Risk-Constrained Budget Pacing for Guaranteed Display Advertising in Online Optimization @@ -0,0 +1 @@ +Guaranteed display (GD) advertising is a critical component of advertising since it provides publishers with stable revenue and enables advertisers to target specific audiences with guaranteed impressions. However, smooth pacing control for online ad delivery presents a challenge due to significant budget disparities, user arrival distribution drift, and dynamic change between supply and demand. This paper presents robust risk-constrained pacing (RCPacing) that utilizes Lagrangian dual multipliers to fine-tune probabilistic throttling through monotonic mapping functions within the percentile space of impression performance distribution. RCPacing combines distribution drift resilience and compatibility with guaranteed allocation mechanism, enabling us to provide near-optimal online services. We also show that RCPacing achieves O(sqrt(T)) dynamic regret where T is the length of the horizon. RCPacing's effectiveness is validated through offline evaluations and online A/B testing conducted on Taobao brand advertising platform. \ No newline at end of file diff --git a/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts b/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts new file mode 100644 index 0000000000..494946466d --- /dev/null +++ b/data/2024/aaai/Performative Federated Learning: A Solution to Model-Dependent and Heterogeneous Distribution Shifts @@ -0,0 +1,3 @@ +We consider a federated learning (FL) system consisting of multiple clients and a server, where the clients aim to collaboratively learn a common decision model from their distributed data. Unlike the conventional FL framework that assumes the client's data is static, we consider scenarios where the clients' data distributions may be reshaped by the deployed decision model. In this work, we leverage the idea of distribution shift mappings in performative prediction to formalize this model-dependent data distribution shift and propose a performative FL framework. +We first introduce necessary and sufficient conditions for the existence of a unique performative stable solution and characterize its distance to the performative optimal solution. Then we propose the performative FedAvg algorithm and show that it converges to the performative stable solution at a rate of O(1/T) under both full and partial participation schemes. +In particular, we use novel proof techniques and show how the clients' heterogeneity influences the convergence. Numerical results validate our analysis and provide valuable insights into real-world applications. \ No newline at end of file diff --git a/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks b/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks new file mode 100644 index 0000000000..413cda6fd2 --- /dev/null +++ b/data/2024/aaai/Permutation-Based Hypothesis Testing for Neural Networks @@ -0,0 +1 @@ +Neural networks are powerful predictive models, but they provide little insight into the nature of relationships between predictors and outcomes. Although numerous methods have been proposed to quantify the relative contributions of input features, statistical inference and hypothesis testing of feature associations remain largely unexplored. We propose a permutation-based approach to testing that uses the partial derivatives of the network output with respect to specific inputs to assess both the significance of input features and whether significant features are linearly associated with the network output. These tests, which can be flexibly applied to a variety of network architectures, enhance the explanatory power of neural networks, and combined with powerful predictive capability, extend the applicability of these models. \ No newline at end of file diff --git a/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models b/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models new file mode 100644 index 0000000000..356700252f --- /dev/null +++ b/data/2024/aaai/Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models @@ -0,0 +1 @@ +Although recent personalization methods have democratized high-resolution image synthesis by enabling swift concept acquisition with minimal examples and lightweight computation, they also present an exploitable avenue for highly accessible backdoor attacks. This paper investigates a critical and unexplored aspect of text-to-image (T2I) diffusion models - their potential vulnerability to backdoor attacks via personalization. By studying the prompt processing of popular personalization methods (epitomized by Textual Inversion and DreamBooth), we have devised dedicated personalization-based backdoor attacks according to the different ways of dealing with unseen tokens and divide them into two families: nouveau-token and legacy-token backdoor attacks. In comparison to conventional backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, our proposed personalization-based backdoor attack method can facilitate more tailored, efficient, and few-shot attacks. Through comprehensive empirical study, we endorse the utilization of the nouveau-token backdoor attack due to its impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token backdoor attack. \ No newline at end of file diff --git a/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding b/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding new file mode 100644 index 0000000000..db1e8898c0 --- /dev/null +++ b/data/2024/aaai/Personalized LoRA for Human-Centered Text Understanding @@ -0,0 +1 @@ +Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA. \ No newline at end of file diff --git a/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies b/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies new file mode 100644 index 0000000000..26012c9ed6 --- /dev/null +++ b/data/2024/aaai/Personalized Reinforcement Learning with a Budget of Policies @@ -0,0 +1 @@ +Personalization in machine learning (ML) tailors models' decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into high-stakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms' ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets. \ No newline at end of file diff --git a/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off b/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off new file mode 100644 index 0000000000..47e3a5965b --- /dev/null +++ b/data/2024/aaai/Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off @@ -0,0 +1 @@ +Neural ranking models (NRMs) have shown great success in information retrieval (IR). But their predictions can easily be manipulated using adversarial examples, which are crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. By incorporating adversarial examples into training data, adversarial training has become the de facto defense approach to adversarial attacks against NRMs. However, this defense mechanism is subject to a trade-off between effectiveness and adversarial robustness. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs. We decompose the robust ranking error into two components, i.e., a natural ranking error for effectiveness evaluation and a boundary ranking error for assessing adversarial robustness. Then, we define the perturbation invariance of a ranking model and prove it to be a differentiable upper bound on the boundary ranking error for attainable computation. Informed by our theoretical analysis, we design a novel perturbation-invariant adversarial training (PIAT) method for ranking models to achieve a better effectiveness-robustness trade-off. We design a regularized surrogate loss, in which one term encourages the effectiveness to be maximized while the regularization term encourages the output to be smooth, so as to improve adversarial robustness. Experimental results on several ranking models demonstrate the superiority of PITA compared to existing adversarial defenses. \ No newline at end of file diff --git a/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors b/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors new file mode 100644 index 0000000000..3d520959dd --- /dev/null +++ b/data/2024/aaai/Pharmacokinetics-Informed Neural Network for Predicting Opioid Administration Moments with Wearable Sensors @@ -0,0 +1 @@ +Long-term and high-dose prescription opioid use places individuals at risk for opioid misuse, opioid use disorder (OUD), and overdose. Existing methods for monitoring opioid use and detecting misuse rely on self-reports, which are prone to reporting bias, and toxicology testing, which may be infeasible in outpatient settings. Although wearable technologies for monitoring day-to-day health metrics have gained significant traction in recent years due to their ease of use, flexibility, and advancements in sensor technology, their application within the opioid use space remains underexplored. In the current work, we demonstrate that oral opioid administrations can be detected using physiological signals collected from a wrist sensor. More importantly, we show that models informed by opioid pharmacokinetics increase reliability in predicting the timing of opioid administrations. Forty-two individuals who were prescribed opioids as a part of their medical treatment in-hospital and after discharge were enrolled. Participants wore a wrist sensor throughout the study, while opioid administrations were tracked using electronic medical records and self-reports. We collected 1,983 hours of sensor data containing 187 opioid administrations from the inpatient setting and 927 hours of sensor data containing 40 opioid administrations from the outpatient setting. We demonstrate that a self-supervised pre-trained model, capable of learning the canonical time series of plasma concentration of the drug derived from opioid pharmacokinetics, can reliably detect opioid administration in both settings. Our work suggests the potential of pharmacokinetic-informed, data-driven models to objectively detect opioid use in daily life. \ No newline at end of file diff --git a/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion b/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion new file mode 100644 index 0000000000..239bf541ff --- /dev/null +++ b/data/2024/aaai/Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion @@ -0,0 +1 @@ +Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method Phoneme Hallucinator that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Quantitative and qualitative evaluations show that Phoneme Hallucinator outperforms existing VC methods for both intelligibility and speaker similarity. \ No newline at end of file diff --git a/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems b/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems new file mode 100644 index 0000000000..1950e43fba --- /dev/null +++ b/data/2024/aaai/Physics-Informed Graph Neural Networks for Water Distribution Systems @@ -0,0 +1 @@ +Water distribution systems (WDS) are an integral part of critical infrastructure which is pivotal to urban development. As 70% of the world's population will likely live in urban environments in 2050, efficient simulation and planning tools for WDS play a crucial role in reaching UN's sustainable developmental goal (SDG) 6 - "Clean water and sanitation for all". In this realm, we propose a novel and efficient machine learning emulator, more precisely, a physics-informed deep learning (DL) model, for hydraulic state estimation in WDS. Using a recursive approach, our model only needs a few graph convolutional neural network (GCN) layers and employs an innovative algorithm based on message passing. Unlike conventional machine learning tasks, the model uses hydraulic principles to infer two additional hydraulic state features in the process of reconstructing the available ground truth feature in an unsupervised manner. To the best of our knowledge, this is the first DL approach to emulate the popular hydraulic simulator EPANET, utilizing no additional information. Like most DL models and unlike the hydraulic simulator, our model demonstrates vastly faster emulation times that do not increase drastically with the size of the WDS. Moreover, we achieve high accuracy on the ground truth and very similar results compared to the hydraulic simulator as demonstrated through experiments on five real-world WDS datasets. \ No newline at end of file diff --git a/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification b/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification new file mode 100644 index 0000000000..6833bb943a --- /dev/null +++ b/data/2024/aaai/Physics-Informed Representation and Learning: Control and Risk Quantification @@ -0,0 +1,2 @@ +Optimal and safety-critical control are fundamental problems for stochastic systems, and are widely considered in real-world scenarios such as robotic manipulation and autonomous driving. In this paper, we consider the problem of efficiently finding optimal and safe control for high-dimensional systems. Specifically, we propose to use dimensionality reduction techniques from a comparison theorem for stochastic differential equations together with a generalizable physics-informed neural network to estimate the optimal value function and the safety probability of the system. The proposed framework results in substantial sample efficiency improvement compared to existing methods. We further develop an autoencoder-like neural network to automatically identify the low-dimensional features in the system to enhance the ease of design for system integration. We also provide experiments and quantitative analysis to validate the efficacy of the proposed method. +Source code is available at https://github.com/jacobwang925/path-integral-PINN. \ No newline at end of file diff --git a/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks b/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks new file mode 100644 index 0000000000..32ff4861b1 --- /dev/null +++ b/data/2024/aaai/Piecewise Linear Transformation - Propagating Aleatoric Uncertainty in Neural Networks @@ -0,0 +1 @@ +Real-world data typically exhibit aleatoric uncertainty which has to be considered during data-driven decision-making to assess the confidence of the decision provided by machine learning models. To propagate aleatoric uncertainty represented by probability distributions (PDs) through neural networks (NNs), both sampling-based and function approximation-based methods have been proposed. However, these methods suffer from significant approximation errors and are not able to accurately represent predictive uncertainty in the NN output. In this paper, we present a novel method, Piecewise Linear Transformation (PLT), for propagating PDs through NNs with piecewise linear activation functions (e.g., ReLU NNs). PLT does not require sampling or specific assumptions about the PDs. Instead, it harnesses the piecewise linear structure of such NNs to determine the propagated PD in the output space. In this way, PLT supports the accurate quantification of predictive uncertainty based on the criterion exactness of the propagated PD. We assess this exactness in theory by showing error bounds for our propagated PD. Further, our experimental evaluation validates that PLT outperforms competing methods on publicly available real-world classification and regression datasets regarding exactness. Thus, the PDs propagated by PLT allow to assess the uncertainty of the provided decisions, offering valuable support. \ No newline at end of file diff --git a/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation b/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation new file mode 100644 index 0000000000..f314da9d18 --- /dev/null +++ b/data/2024/aaai/Plug-In Diffusion Model for Sequential Recommendation @@ -0,0 +1 @@ +Pioneering efforts have verified the effectiveness of the diffusion models in exploring the informative uncertainty for recommendation. Considering the difference between recommendation and image synthesis tasks, existing methods have undertaken tailored refinements to the diffusion and reverse process. However, these approaches typically use the highest-score item in corpus for user interest prediction, leading to the ignorance of the user's generalized preference contained within other items, thereby remaining constrained by the data sparsity issue. To address this issue, this paper presents a novel Plug-in Diffusion Model for Recommendation (PDRec) framework, which employs the diffusion model as a flexible plugin to jointly take full advantage of the diffusion-generating user preferences on all items. Specifically, PDRec first infers the users' dynamic preferences on all items via a time-interval diffusion model and proposes a Historical Behavior Reweighting (HBR) mechanism to identify the high-quality behaviors and suppress noisy behaviors. In addition to the observed items, PDRec proposes a Diffusion-based Positive Augmentation (DPA) strategy to leverage the top-ranked unobserved items as the potential positive samples, bringing in informative and diverse soft signals to alleviate data sparsity. To alleviate the false negative sampling issue, PDRec employs Noise-free Negative Sampling (NNS) to select stable negative samples for ensuring effective model optimization. Extensive experiments and analyses on four datasets have verified the superiority of the proposed PDRec over the state-of-the-art baselines and showcased the universality of PDRec as a flexible plugin for commonly-used sequential encoders in different recommendation scenarios. The code is available in https://github.com/hulkima/PDRec. \ No newline at end of file diff --git a/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation b/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation new file mode 100644 index 0000000000..7c88e61bea --- /dev/null +++ b/data/2024/aaai/PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation @@ -0,0 +1,2 @@ +Controllable text generation is a challenging and meaningful field in natural language generation (NLG). Especially, poetry generation is a typical one with well-defined and strict conditions for text generation which is an ideal playground for the assessment of current methodologies. While prior works succeeded in controlling either semantic or metrical aspects of poetry generation, simultaneously addressing both remains a challenge. In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry to tackle such challenges. In terms of semantics, our PoetryDiffusion model, built upon the Diffusion model, generates entire sentences or poetry by comprehensively considering the entirety of sentence information. This approach enhances semantic expression, distinguishing it from autoregressive and large language models (LLMs). For metrical control, its constraint control module which can be trained individually enables us to flexibly incorporate a novel metrical controller to manipulate and evaluate metrics (format and rhythm). +The denoising process in PoetryDiffusion allows for the gradual enhancement of semantics and flexible integration of the metrical controller which can calculate and impose penalties on states that stray significantly from the target control distribution. Experimental results on two datasets demonstrate that our model outperforms existing models in terms of automatic evaluation of semantic, metrical, and overall performance as well as human evaluation. Codes are released to https://github.com/ChorlingLau/PoetryDiffusion. \ No newline at end of file diff --git "a/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding" "b/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding" new file mode 100644 index 0000000000..57cc5e4b34 --- /dev/null +++ "b/data/2024/aaai/Poincar\303\251 Differential Privacy for Hierarchy-Aware Graph Embedding" @@ -0,0 +1 @@ +Hierarchy is an important and commonly observed topological property in real-world graphs that indicate the relationships between supervisors and subordinates or the organizational behavior of human groups. As hierarchy is introduced as a new inductive bias into the Graph Neural Networks (GNNs) in various tasks, it implies latent topological relations for attackers to improve their inference attack performance, leading to serious privacy leakage issues. In addition, existing privacy-preserving frameworks suffer from reduced protection ability in hierarchical propagation due to the deficiency of adaptive upper-bound estimation of the hierarchical perturbation boundary. It is of great urgency to effectively leverage the hierarchical property of data while satisfying privacy guarantees. To solve the problem, we propose the Poincar\'e Differential Privacy framework, named PoinDP, to protect the hierarchy-aware graph embedding based on hyperbolic geometry. Specifically, PoinDP first learns the hierarchy weights for each entity based on the Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware Sensitivity is designed to measure the sensitivity of the hierarchical structure and adaptively allocate the privacy protection strength. Besides, Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian mechanism in Euclidean space to hyperbolic space to realize random perturbations that satisfy differential privacy under the hyperbolic space metric. Extensive experiment results on five real-world datasets demonstrate the proposed PoinDP’s advantages of effective privacy protection while maintaining good performance on the node classification task. \ No newline at end of file diff --git a/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection b/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection new file mode 100644 index 0000000000..7887be3690 --- /dev/null +++ b/data/2024/aaai/Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection @@ -0,0 +1 @@ +Ideal part editing should guarantee the diversity of edited parts, the fidelity to the remaining parts, and the quality of the results. However, previous methods do not disentangle each part completely, which means the edited parts will affect the others, resulting in poor diversity and fidelity. In addition, some methods lack constraints between parts, which need manual selections of edited results to ensure quality. Therefore, we propose a four-stage process for point cloud part editing: Segmentation, Generation, Assembly, and Selection. Based on this process, we introduce SGAS, a model for part editing that employs two strategies: feature disentanglement and constraint. By independently fitting part-level feature distributions, we realize the feature disentanglement. By explicitly modeling the transformation from object-level distribution to part-level distributions, we realize the feature constraint. Considerable experiments on different datasets demonstrate the efficiency and effectiveness of SGAS on point cloud part editing. In addition, SGAS can be pruned to realize unsupervised part-aware point cloud generation and achieves state-of-the-art results. \ No newline at end of file diff --git a/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images b/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images new file mode 100644 index 0000000000..6378e668b6 --- /dev/null +++ b/data/2024/aaai/Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images @@ -0,0 +1 @@ +Directly predicting human epidermal growth factor receptor 2 (HER2) status from widely available hematoxylin and eosin (HE)-stained whole slide images (WSIs) can reduce technical costs and expedite treatment selection. Accurately predicting HER2 requires large collections of multi-site WSIs. Federated learning enables collaborative training of these WSIs without gigabyte-size WSIs transportation and data privacy concerns. However, federated learning encounters challenges in addressing label imbalance in multi-site WSIs from the real world. Moreover, existing WSI classification methods cannot simultaneously exploit local context information and long-range dependencies in the site-end feature representation of federated learning. To address these issues, we present a point transformer with federated learning for multi-site HER2 status prediction from HE-stained WSIs. Our approach incorporates two novel designs. We propose a dynamic label distribution strategy and an auxiliary classifier, which helps to establish a well-initialized model and mitigate label distribution variations across sites. Additionally, we propose a farthest cosine sampling based on cosine distance. It can sample the most distinctive features and capture the long-range dependencies. Extensive experiments and analysis show that our method achieves state-of-the-art performance at four sites with a total of 2687 WSIs. Furthermore, we demonstrate that our model can generalize to two unseen sites with 229 WSIs. Code is available at: https://github.com/boyden/PointTransformerFL \ No newline at end of file diff --git a/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models b/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models new file mode 100644 index 0000000000..ceb9017c4a --- /dev/null +++ b/data/2024/aaai/Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models @@ -0,0 +1 @@ +The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT. \ No newline at end of file diff --git a/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification b/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification new file mode 100644 index 0000000000..c1acaf271e --- /dev/null +++ b/data/2024/aaai/Point-to-Spike Residual Learning for Energy-Efficient 3D Point Cloud Classification @@ -0,0 +1 @@ +Spiking neural networks (SNNs) have revolutionized neural learning and are making remarkable strides in image analysis and robot control tasks with ultra-low power consumption advantages. Inspired by this success, we investigate the application of spiking neural networks to 3D point cloud processing. We present a point-to-spike residual learning network for point cloud classification, which operates on points with binary spikes rather than floating-point numbers. Specifically, we first design a spatial-aware kernel point spiking neuron to relate spiking generation to point position in 3D space. On this basis, we then design a 3D spiking residual block for effective feature learning based on spike sequences. By stacking the 3D spiking residual blocks, we build the point-to-spike residual classification network, which achieves low computation cost and low accuracy loss on two benchmark datasets, ModelNet40 and ScanObjectNN. Moreover, the classifier strikes a good balance between classification accuracy and biological characteristics, allowing us to explore the deployment of 3D processing to neuromorphic chips for developing energy-efficient 3D robotic perception systems. \ No newline at end of file diff --git a/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition b/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition new file mode 100644 index 0000000000..01cc1e2cf4 --- /dev/null +++ b/data/2024/aaai/Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition @@ -0,0 +1 @@ +Recognition in open-world scenarios is an important and challenging field, where Vision-Language Pre-training paradigms have greatly impacted the 2D domain. This inspires a growing interest in introducing 2D pre-trained models, such as CLIP, into the 3D domain to enhance the ability of point cloud understanding. Considering the difference between discrete 3D point clouds and real-world 2D images, reducing the domain gap is crucial. Some recent works project point clouds onto a 2D plane to enable 3D zero-shot capabilities without training. However, this simplistic approach leads to an unclear or even distorted geometric structure, limiting the potential of 2D pre-trained models in 3D. To address the domain gap, we propose Point2Real, a training-free framework based on the realistic rendering technique to automate the transformation of the 3D point cloud domain into the Vision-Language domain. Specifically, Point2Real leverages a shape recovery module that devises an iterative ball-pivoting algorithm to convert point clouds into meshes, narrowing the gap in shape at first. To simulate photo-realistic images, a set of refined textures as candidates is applied for rendering, where the CLIP confidence is utilized to select the suitable one. Moreover, to tackle the viewpoint challenge, a heuristic multi-view adapter is implemented for feature aggregation, which exploits the depth surface as an effective indicator of view-specific discriminability for recognition. We conduct experiments on ModelNet10, ModelNet40, and ScanObjectNN datasets, and the results demonstrate that Point2Real outperforms other approaches in zero-shot and few-shot tasks by a large margin. \ No newline at end of file diff --git a/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion b/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion new file mode 100644 index 0000000000..30bac25adb --- /dev/null +++ b/data/2024/aaai/PointAttN: You Only Need Attention for Point Cloud Completion @@ -0,0 +1 @@ +Point cloud completion referring to completing 3D shapes from partial 3D point clouds is a fundamental problem for 3D point cloud analysis tasks. Benefiting from the development of deep neural networks, researches on point cloud completion have made great progress in recent years. However, the explicit local region partition like kNNs involved in existing methods makes them sensitive to the density distribution of point clouds. Moreover, it serves limited receptive fields that prevent capturing features from long-range context information. To solve the problems, we leverage the cross-attention and self-attention mechanisms to design novel neural network for point cloud completion with implicit local region partition. Two basic units Geometric Details Perception (GDP) and Self-Feature Augment (SFA) are proposed to establish the structural relationships directly among points in a simple yet effective way via attention mechanism. Then based on GDP and SFA, we construct a new framework with popular encoder-decoder architecture for point cloud completion. The proposed framework, namely PointAttN, is simple, neat and effective, which can precisely capture the structural information of 3D shapes and predict complete point clouds with detailed geometry. Experimental results demonstrate that our PointAttN outperforms state-of-the-art methods on multiple challenging benchmarks. Code is available at: https://github.com/ohhhyeahhh/PointAttN \ No newline at end of file diff --git a/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification b/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification new file mode 100644 index 0000000000..55c8f825d5 --- /dev/null +++ b/data/2024/aaai/PointCVaR: Risk-Optimized Outlier Removal for Robust 3D Point Cloud Classification @@ -0,0 +1 @@ +With the growth of 3D sensing technology, the deep learning system for 3D point clouds has become increasingly important, especially in applications such as autonomous vehicles where safety is a primary concern. However, there are growing concerns about the reliability of these systems when they encounter noisy point clouds, either occurring naturally or introduced with malicious intent. This paper highlights the challenges of point cloud classification posed by various forms of noise, from simple background noise to malicious adversarial/backdoor attacks that can intentionally skew model predictions. While there's an urgent need for optimized point cloud denoising, current point outlier removal approaches, an essential step for denoising, rely heavily on handcrafted strategies and are not adapted for higher-level tasks, such as classification. To address this issue, we introduce an innovative point outlier cleansing method that harnesses the power of downstream classification models. Using gradient-based attribution analysis, we define a novel concept: point risk. Drawing inspiration from tail risk minimization in finance, we recast the outlier removal process as an optimization problem, named PointCVaR. Extensive experiments show that our proposed technique not only robustly filters diverse point cloud outliers but also consistently and significantly enhances existing robust methods for point cloud classification. A notable feature of our approach is its effectiveness in defending against the latest threat of backdoor attacks in point clouds. \ No newline at end of file diff --git a/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring b/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring new file mode 100644 index 0000000000..24c0065493 --- /dev/null +++ b/data/2024/aaai/PointPatchMix: Point Cloud Mixing with Patch Scoring @@ -0,0 +1 @@ +Data augmentation is an effective regularization strategy for mitigating overfitting in deep neural networks, and it plays a crucial role in 3D vision tasks, where the point cloud data is relatively limited. While mixing-based augmentation has shown promise for point clouds, previous methods mix point clouds either on block level or point level, which has constrained their ability to strike a balance between generating diverse training samples and preserving the local characteristics of point clouds. The significance of each part component of the point clouds has not been fully considered, as not all parts contribute equally to the classification task, and some parts may contain unimportant or redundant information. To overcome these challenges, we propose PointPatchMix, a novel approach that mixes point clouds at the patch level and integrates a patch scoring module to generate content-based targets for mixed point clouds. Our approach preserves local features at the patch level, while the patch scoring module assigns targets based on the content-based significance score from a pre-trained teacher model. We evaluate PointPatchMix on two benchmark datasets including ModelNet40 and ScanObjectNN, and demonstrate significant improvements over various baselines in both synthetic and real-world datasets, as well as few-shot settings. With Point-MAE as our baseline, our model surpasses previous methods by a significant margin. Furthermore, our approach shows strong generalization across various point cloud methods and enhances the robustness of the baseline model. Code is available at https://jiazewang.com/projects/pointpatchmix.html. \ No newline at end of file diff --git a/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation b/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation new file mode 100644 index 0000000000..4c7aeac869 --- /dev/null +++ b/data/2024/aaai/Polyper: Boundary Sensitive Polyp Segmentation @@ -0,0 +1 @@ +We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose to explicitly leverages boundary regions to bolster the model's boundary discrimination capability while minimizing computational resource wastage. Our approach first extracts low-confidence boundary regions and high-confidence prediction regions from an initial segmentation map through differentiable morphological operators.Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the high-confidence prediction region's characteristics to generate good segmentation results.Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer.To evaludate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git. \ No newline at end of file diff --git a/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF b/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF new file mode 100644 index 0000000000..6964011f6b --- /dev/null +++ b/data/2024/aaai/PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF @@ -0,0 +1 @@ +This paper proposes an end-to-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF). Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data. As a result, pose estimators trained on public datasets significantly underperform when applied to unseen out-of-distribution samples. Previous works proposed augmenting public datasets by generating 2D-3D pose pairs or rendering a large amount of random data. Such approaches either overlook image rendering or result in suboptimal datasets for pre-trained models. Here we propose PoseGen, which learns to generate a dataset (human 3D poses and images) with a feedback loss from a given pre-trained pose estimator. In contrast to prior art, our generated data is optimized to improve the robustness of the pre-trained model. The objective of PoseGen is to learn a distribution of data that maximizes the prediction error of a given pre-trained model. As the learned data distribution contains OOD samples of the pre-trained model, sampling data from such a distribution for further fine-tuning a pre-trained model improves the generalizability of the model. This is the first work that proposes NeRFs for 3D human data generation. NeRFs are data-driven and do not require 3D scans of humans. Therefore, using NeRF for data generation is a new direction for convenient user-specific data generation. Our extensive experiments show that the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four datasets with an average 6% relative improvement. \ No newline at end of file diff --git a/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint) b/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint) new file mode 100644 index 0000000000..0e7b6861cf --- /dev/null +++ b/data/2024/aaai/Post-trained Convolution Networks for Single Image Super-resolution (Abstract Reprint) @@ -0,0 +1 @@ +A new method is proposed to increase the accuracy of the state-of-the-art single image super-resolution (SISR) using novel training procedure. The proposed method, named post-trained convolutional neural network (CNN), is carried out stochastic dual simplex algorithm (SDSA) in the last reconstruction layer. The method utilizes contextual information to update the last reconstruction layer of CNN. The extracted contextual information is projected to the last reconstructed layer by optimized weights and the bias is managed through SDSA. Post-trained CNN is applied to the very deep super-resolution (VDSR) method to show its performance. The quantitative and visual results demonstrate that the proposed post-trained VDSR (PTVDSR) exhibits excellent and competitive performance when compared with the VDSR and other super-resolution methods. \ No newline at end of file diff --git a/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract) b/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract) new file mode 100644 index 0000000000..12fcfa65e7 --- /dev/null +++ b/data/2024/aaai/Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract) @@ -0,0 +1 @@ +Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training. \ No newline at end of file diff --git a/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract) b/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract) new file mode 100644 index 0000000000..abe3fe19b1 --- /dev/null +++ b/data/2024/aaai/Power Grid Anomaly Detection via Hybrid LSTM-GIN Model (Student Abstract) @@ -0,0 +1 @@ +Cyberattacks on power grids pose significant risks to national security. Power grid attacks typically lead to abnormal readings in power output, frequency, current, and voltage. Due to the interconnected structure of power grids, abnormalities can spread throughout the system and cause widespread power outages if not detected and dealt with promptly. Our research proposes a novel anomaly detection system for power grids that prevents overfitting. We created a network graph to represent the structure of the power grid, where nodes represent power grid components like generators and edges represent connections between nodes such as overhead power lines. We combine the capabilities of Long Short-Term Memory (LSTM) models with a Graph Isomorphism Network (GIN) in a hybrid model to pinpoint anomalies in the grid. We train our model on each category of nodes that serves a similar structural purpose to prevent overfitting of the model. We then assign each node in the graph a unique signature using a GIN. Our model achieved a 99.92% accuracy rate, which is significantly higher than a version of our model without structural encoding, which had an accuracy level of 97.30%. Our model allows us to capture structural and temporal components of power grids and develop an attack detection system with high accuracy without overfitting. \ No newline at end of file diff --git a/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract) b/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract) new file mode 100644 index 0000000000..acf956ad61 --- /dev/null +++ b/data/2024/aaai/Power-Aware Inverse-Search Machine Learning for Low Resource Multi-Objective Unmanned Underwater Vehicle Control (Student Abstract) @@ -0,0 +1 @@ +Flapping-fin unmanned underwater vehicle (UUV) propulsion systems enable high maneuverability for tasks ranging from station-keeping to surveillance but are often constrained by their limited computational power and battery capacity. Previous research has demonstrated that time-series neural network models can accurately predict the thrust and power of certain fin kinematics based on the specified gait coupled with the fin configuration, but can not fit an inverse neural network that takes a thrust request and tunes the kinematics by weighting thrust generation, smooth movement transitions, and power attributes. We study various combinations of the three weights and fin materials to create different ‘modes’ of movement for a multi-objective UUV, based on controller intent using an inverse neural network. Finally, we implement and validate an enhanced power-aware inverse model by benchmarking on the Raspberry Pi Model 4B system and testing through generated simulated movements. \ No newline at end of file diff --git a/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks b/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks new file mode 100644 index 0000000000..3635eec22a --- /dev/null +++ b/data/2024/aaai/Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks @@ -0,0 +1 @@ +The Machine-Learning-as-a-Service (MLaaS) framework allows one to grab low-hanging fruit of machine learning techniques and data science, without either much expertise for this sophisticated sphere or provision of specific infrastructures. However, the requirement of revealing all training data to the service provider raises new concerns in terms of privacy leakage, storage consumption, efficiency, bandwidth, etc. In this paper, we propose a lightweight privacy-preserving MLaaS framework by combining Compressive Sensing (CS) and Generative Networks. It’s constructed on the favorable facts observed in recent works that general inference tasks could be fulfilled with generative networks and classifier trained on compressed measurements, since the generator could model the data distribution and capture discriminative information which are useful for classification. To improve the performance of the MLaaS framework, the supervised generative models of the server are trained and optimized with prior knowledge provided by the client. In order to prevent the service provider from recovering the original data as well as identifying the queried results, a noise-addition mechanism is designed and adopted into the compressed data domain. Empirical results confirmed its performance superiority in accuracy and resource consumption against the state-of-the-art privacy preserving MLaaS frameworks. \ No newline at end of file diff --git a/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing b/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing new file mode 100644 index 0000000000..e2a281629a --- /dev/null +++ b/data/2024/aaai/Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing @@ -0,0 +1 @@ +Sentiment analysis provides a promising tool to automatically assess the emotions voiced in written student feedback such as periodically collected unit-of-study reflections. The commonly used dictionary-based approaches are limited to major languages and fail to capture contextual differences. Pretrained large language models have been shown to be biased and online versions raise privacy concerns. Hence, we resort to traditional supervised machine learning (ML) approaches which are designed to overcome these issues by learning from domain-specific labeled data. However, these labels are hard to come by -- in our case manually annotating student feedback is prone to bias and time-consuming, especially in high-enrollment courses. In this work, we investigate the use of student crowdsourced labels for supervised sentiment analysis for education. Specifically, we compare crowdsourced and student self-reported labels with human expert annotations and use them in various ML approaches to evaluate the performance on predicting emotions of written student feedback collected from large computer science classes. We find that the random forest model trained with student-crowdsourced labels tremendously improves the identification of reflections with negative sentiment. In addition to our quantitative study, we describe our crowdsourcing experiment which was intentionally designed to be an educational activity in an introduction to data science course. \ No newline at end of file diff --git a/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection b/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection new file mode 100644 index 0000000000..0530627052 --- /dev/null +++ b/data/2024/aaai/Pre-trained Online Contrastive Learning for Insurance Fraud Detection @@ -0,0 +1 @@ +Medical insurance fraud has always been a crucial challenge in the field of healthcare industry. Existing fraud detection models mostly focus on offline learning scenes. However, fraud patterns are constantly evolving, making it difficult for models trained on past data to detect newly emerging fraud patterns, posing a severe challenge in medical fraud detection. Moreover, current incremental learning models are mostly designed to address catastrophic forgetting, but often exhibit suboptimal performance in fraud detection. To address this challenge, this paper proposes an innovative online learning method for medical insurance fraud detection, named POCL. This method combines contrastive learning pre-training with online updating strategies. In the pre-training stage, we leverage contrastive learning pre-training to learn on historical data, enabling deep feature learning and obtaining rich risk representations. In the online learning stage, we adopt a Temporal Memory Aware Synapses online updating strategy, allowing the model to perform incremental learning and optimization based on continuously emerging new data. This ensures timely adaptation to fraud patterns and reduces forgetting of past knowledge. Our model undergoes extensive experiments and evaluations on real-world insurance fraud datasets. The results demonstrate our model has significant advantages in accuracy compared to the state-of-the-art baseline methods, while also exhibiting lower running time and space consumption. Our sources are released at https://github.com/finint/POCL. \ No newline at end of file diff --git a/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling b/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling new file mode 100644 index 0000000000..e7bb6a1ebb --- /dev/null +++ b/data/2024/aaai/PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling @@ -0,0 +1 @@ +Pre-routing timing prediction has been recently studied for evaluating the quality of a candidate cell placement in chip design. It involves directly estimating the timing metrics for both pin-level (slack, slew) and edge-level (net delay, cell delay), without time-consuming routing. However, it often suffers from signal decay and error accumulation due to the long timing paths in large-scale industrial circuits. To address these challenges, we propose a two-stage approach. First, we propose global circuit training to pre-train a graph auto-encoder that learns the global graph embedding from circuit netlist. Second, we use a novel node updating scheme for message passing on GCN, following the topological sorting sequence of the learned graph embedding and circuit graph. This scheme residually models the local time delay between two adjacent pins in the updating sequence, and extracts the lookup table information inside each cell via a new attention mechanism. To handle large-scale circuits efficiently, we introduce an order preserving partition scheme that reduces memory consumption while maintaining the topological dependencies. Experiments on 21 real world circuits achieve a new SOTA R2 of 0.93 for slack prediction, which is significantly surpasses 0.59 by previous SOTA method. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI. \ No newline at end of file diff --git a/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning b/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning new file mode 100644 index 0000000000..605202a69c --- /dev/null +++ b/data/2024/aaai/Predicting Real-World Penny Auction Durations by Integrating Game Theory and Machine Learning @@ -0,0 +1 @@ +Game theory and machine learning are two widely used techniques for predicting the outcomes of strategic interactions among humans. However, the game theory-based approach often relies on strong rationality and informational assumptions, while the machine learning-based approach typically requires the testing data to come from the same distribution as the training data. Our work studies how to integrate the two techniques to address these weaknesses. We focus on the interactions among real bidders in penny auctions, and develop a three-stage framework to predict the distributions of auction durations, which indicate the numbers of bids and auctioneer revenues. Specifically, we first leverage a pre-trained neural network to encode the descriptions of products in auctions into embeddings. Second, we apply game theory models to make preliminary predictions of auction durations. In particular, we tackle the challenge of accurately inferring parameters in game theory models. Third, we develop a Multi-Branch Mixture Density Network to learn the mapping from product embeddings and game-theoretic predictions to the distributions of actual auction durations. Experiments on real-world penny auction data demonstrate that our framework outperforms both game theory-based and machine learning-based prediction approaches. \ No newline at end of file diff --git a/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation b/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation new file mode 100644 index 0000000000..78b49c01ec --- /dev/null +++ b/data/2024/aaai/PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation @@ -0,0 +1 @@ +Video-based facial analysis is important for autonomous agents to understand human expressions and sentiments. However, limited labeled data is available to learn effective facial representations. This paper proposes a novel self-supervised face-centric pretraining framework, called PrefAce, which learns transferable video facial representation without labels. The self-supervised learning is performed with an effective landmark-guided global-local tube distillation. Meanwhile, a novel instance-wise update FaceFeat Cache is built to enforce more discriminative and diverse representations for downstream tasks. Extensive experiments demonstrate that the proposed framework learns universal instance-aware facial representations with fine-grained landmark details from videos. The point is that it can transfer across various facial analysis tasks, e.g., Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our framework also outperforms the state-of-the-art on various downstream tasks, even in low data regimes. Code is available at https://github.com/siyuan-h/PrefAce. \ No newline at end of file diff --git a/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation b/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation new file mode 100644 index 0000000000..400a540e1a --- /dev/null +++ b/data/2024/aaai/Preference Aware Dual Contrastive Learning for Item Cold-Start Recommendation @@ -0,0 +1 @@ +Existing cold-start recommendation methods often adopt item-level alignment strategies to align the content feature and the collaborative feature of warm items for model training, however, cold items in the test stage have no historical interactions with users to obtain the collaborative feature. These existing models ignore the aforementioned condition of cold items in the training stage, resulting in the performance limitation. In this paper, we propose a preference aware dual contrastive learning based recommendation model (PAD-CLRec), where the user preference is explored to take into account the condition of cold items for feature alignment. Here, the user preference is obtained by aggregating a group of collaborative feature of the warm items in the user's purchase records. Then, a group-level alignment between the user preference and the item's content feature can be realized via a proposed preference aware contrastive function for enhancing cold-item recommendation. In addition, a joint objective function is introduced to achieve a better trade-off between the recommendation performance of warm items and cold items from both item-level and group-level perspectives, yielding better overall recommendation performance. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method, and the results show the superiority of our method, as compared with the state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract) b/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract) new file mode 100644 index 0000000000..ddd4a60b35 --- /dev/null +++ b/data/2024/aaai/Preference-Aware Constrained Multi-Objective Bayesian Optimization (Student Abstract) @@ -0,0 +1 @@ +This paper addresses the problem of constrained multi-objective optimization over black-box objective functions with practitioner-specified preferences over the objectives when a large fraction of the input space is infeasible (i.e., violates constraints). This problem arises in many engineering design problems, including analog circuits and electric power system design. We aim to approximate the optimal Pareto set over the small fraction of feasible input designs. The key challenges include the massive size of the design space, multiple objectives, a large number of constraints, and the small fraction of feasible input designs, which can be identified only after performing expensive experiments/simulations. We propose a novel and efficient preference-aware constrained multi-objective Bayesian optimization approach referred to as PAC-MOO to address these challenges. The key idea is to learn surrogate models for both output objectives and constraints, and select the candidate input for evaluation in each iteration that maximizes the information gained about the optimal constrained Pareto front while factoring in the preferences over objectives. Our experiments on synthetic and challenging real-world analog circuit design optimization problems demonstrate the efficacy of PAC-MOO over baseline methods. \ No newline at end of file diff --git a/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models b/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models new file mode 100644 index 0000000000..fb4f68511a --- /dev/null +++ b/data/2024/aaai/Preparing Lessons for Progressive Training on Language Models @@ -0,0 +1 @@ +The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning high-layer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs. \ No newline at end of file diff --git a/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance b/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance new file mode 100644 index 0000000000..0fbefc0fb4 --- /dev/null +++ b/data/2024/aaai/Preventing Eviction-Caused Homelessness through ML-Informed Distribution of Rental Assistance @@ -0,0 +1 @@ +Rental assistance programs provide individuals with financial assistance to prevent housing instabilities caused by evictions and avert homelessness. Since these programs operate under resource constraints, they must decide who to prioritize. Typically, funding is distributed by a reactive allocation process that does not systematically consider risk of future homelessness. We partnered with Anonymous County (PA) to explore a proactive and preventative allocation approach that prioritizes individuals facing eviction based on their risk of future homelessness. Our ML models, trained on state and county administrative data accurately identify at-risk individuals, outperforming simpler prioritization approaches by at least 20% while meeting our equity and fairness goals across race and gender. Furthermore, our approach would reach 28% of individuals who are overlooked by the current process and end up homeless. Beyond improvements to the rental assistance program in Anonymous County, this study can inform the development of evidence-based decision support tools in similar contexts, including lessons about data needs, model design, evaluation, and field validation. \ No newline at end of file diff --git a/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming b/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming new file mode 100644 index 0000000000..6c218bf2fb --- /dev/null +++ b/data/2024/aaai/Primitive-Based 3D Human-Object Interaction Modelling and Programming @@ -0,0 +1,3 @@ +Embedding Human and Articulated Object Interaction (HAOI) in 3D is an important direction for a deeper human activity understanding. Different from previous works that use parametric and CAD models to represent humans and objects, in this work, we propose a novel 3D geometric primitive-based language to encode both humans and objects. Given our new paradigm, humans and objects are all compositions of primitives instead of heterogeneous entities. Thus, mutual information learning may be achieved between the limited 3D data of humans and different object categories. Moreover, considering the simplicity of the expression and the richness of the information it contains, we choose the superquadric as the primitive representation. +To explore an effective embedding of HAOI for the machine, we build a new benchmark on 3D HAOI consisting of primitives together with their images and propose a task requiring machines to recover 3D HAOI using primitives from images. +Moreover, we propose a baseline of single-view 3D reconstruction on HAOI. We believe this primitive-based 3D HAOI representation would pave the way for 3D HAOI studies. Our code and data are available at https://mvig-rhos.com/p3haoi. \ No newline at end of file diff --git a/data/2024/aaai/Principal-Agent Reward Shaping in MDPs b/data/2024/aaai/Principal-Agent Reward Shaping in MDPs new file mode 100644 index 0000000000..b3f2072b39 --- /dev/null +++ b/data/2024/aaai/Principal-Agent Reward Shaping in MDPs @@ -0,0 +1 @@ +Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon. \ No newline at end of file diff --git a/data/2024/aaai/Principle Component Trees and Their Persistent Homology b/data/2024/aaai/Principle Component Trees and Their Persistent Homology new file mode 100644 index 0000000000..a1ab512a49 --- /dev/null +++ b/data/2024/aaai/Principle Component Trees and Their Persistent Homology @@ -0,0 +1 @@ +Low dimensional models like PCA are often used to simplify complex datasets by learning a single approximating subspace. This paradigm has expanded to union of subspaces models, like those learned by subspace clustering. In this paper, we present Principal Component Trees (PCTs), a graph structure that generalizes these ideas to identify mixtures of components that together describe the subspace structure of high-dimensional datasets. Each node in a PCT corresponds to a principal component of the data, and the edges between nodes indicate the components that must be mixed to produce a subspace that approximates a portion of the data. In order to construct PCTs, we propose two angle-distribution hypothesis tests to detect subspace clusters in the data. To analyze, compare, and select the best PCT model, we define two persistent homology measures that describe their shape. We show our construction yields two key properties of PCTs, namely ancestral orthogonality and non-decreasing singular values. Our main theoretical results show that learning PCTs reduces to PCA under multivariate normality, and that PCTs are efficient parameterizations of intersecting union of subspaces. Finally, we use PCTs to analyze neural network latent space, word embeddings, and reference image datasets. \ No newline at end of file diff --git a/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring b/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring new file mode 100644 index 0000000000..0bd57f0bb2 --- /dev/null +++ b/data/2024/aaai/Prior and Prediction Inverse Kernel Transformer for Single Image Defocus Deblurring @@ -0,0 +1,2 @@ +Defocus blur, due to spatially-varying sizes and shapes, is hard to remove. Existing methods either are unable to effectively handle irregular defocus blur or fail to generalize well on other datasets. +In this work, we propose a divide-and-conquer approach to tackling this issue, which gives rise to a novel end-to-end deep learning method, called prior-and-prediction inverse kernel transformer (P2IKT), for single image defocus deblurring. Since most defocus blur can be approximated as Gaussian blur or its variants, we construct an inverse Gaussian kernel module in our method to enhance its generalization ability. At the same time, an inverse kernel prediction module is introduced in order to flexibly address the irregular blur that cannot be approximated by Gaussian blur. We further design a scale recurrent transformer, which estimates mixing coefficients for adaptively combining the results from the two modules and runs the scale recurrent ``coarse-to-fine" procedure for progressive defocus deblurring. Extensive experimental results demonstrate that our P2IKT outperforms previous methods in terms of PSNR on multiple defocus deblurring datasets. \ No newline at end of file diff --git a/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions b/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions new file mode 100644 index 0000000000..2d8c686517 --- /dev/null +++ b/data/2024/aaai/Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions @@ -0,0 +1,7 @@ +We examine a private ADMM variant for (strongly) convex objectives which is a primal-dual iterative method. Each iteration has a user with a private function used to update the primal variable, masked by Gaussian noise for local privacy, without directly adding noise to the dual variable. Privacy amplification by iteration explores if noises from later iterations can enhance the privacy guarantee when releasing final variables after the last iteration. + +Cyffers et al. explored privacy amplification by iteration for the proximal ADMM variant, where a user's entire private function is accessed and noise is added to the primal variable. In contrast, we examine a private ADMM variant requiring just one gradient access to a user's function, but both primal and dual variables must be passed between successive iterations. + +To apply Balle et al.'s coupling framework to the gradient ADMM variant, we tackle technical challenges with novel ideas. First, we address the non-expansive mapping issue in ADMM iterations by using a customized norm. Second, because the dual variables are not masked with any noise directly, their privacy guarantees are achieved by treating two consecutive noisy ADMM iterations as a Markov operator. + +Our main result is that the privacy guarantee for the gradient ADMM variant can be amplified proportionally to the number of iterations. For strongly convex objective functions, this amplification exponentially increases with the number of iterations. These amplification results align with the previously studied special case of stochastic gradient descent. \ No newline at end of file diff --git a/data/2024/aaai/Privileged Prior Information Distillation for Image Matting b/data/2024/aaai/Privileged Prior Information Distillation for Image Matting new file mode 100644 index 0000000000..b68e9a8fe2 --- /dev/null +++ b/data/2024/aaai/Privileged Prior Information Distillation for Image Matting @@ -0,0 +1 @@ +Performance of trimap-free image matting methods is limited when trying to decouple the deterministic and undetermined regions, especially in the scenes where foregrounds are semantically ambiguous, chromaless, or high transmittance. In this paper, we propose a novel framework named Privileged Prior Information Distillation for Image Matting (PPID-IM) that can effectively transfer privileged prior environment-aware information to improve the performance of trimap-free students in solving hard foregrounds. The prior information of trimap regulates only the teacher model during the training stage, while not being fed into the student network during actual inference. To achieve effective privileged cross-modality (i.e. trimap and RGB) information distillation, we introduce a Cross-Level Semantic Distillation (CLSD) module that reinforces the students with more knowledgeable semantic representations and environment-aware information. We also propose an Attention-Guided Local Distillation module that efficiently transfers privileged local attributes from the trimap-based teacher to trimap-free students for the guidance of local-region optimization. Extensive experiments demonstrate the effectiveness and superiority of our PPID on image matting. The code will be released soon. \ No newline at end of file diff --git a/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models b/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models new file mode 100644 index 0000000000..e4be643cf7 --- /dev/null +++ b/data/2024/aaai/ProAgent: Building Proactive Cooperative Agents with Large Language Models @@ -0,0 +1 @@ +Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multi-agent systems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy generalization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents' capacity for strategic adaptation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then updates its beliefs in alignment with the teammates' subsequent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios. Experimental evaluations conducted within the Overcooked-AI environment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its performance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more information about our project, please visit https://pku-proagent.github.io. \ No newline at end of file diff --git a/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning b/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning new file mode 100644 index 0000000000..9f32b29e1c --- /dev/null +++ b/data/2024/aaai/ProCC: Progressive Cross-Primitive Compatibility for Open-World Compositional Zero-Shot Learning @@ -0,0 +1 @@ +Open-World Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel compositions of state and object primitives in images with no priors on the compositional space, which induces a tremendously large output space containing all possible state-object compositions. Existing works either learn the joint compositional state-object embedding or predict simple primitives with separate classifiers. However, the former method heavily relies on external word embedding methods, and the latter ignores the interactions of interdependent primitives, respectively. In this paper, we revisit the primitive prediction approach and propose a novel method, termed Progressive Cross-primitive Compatibility (ProCC), to mimic the human learning process for OW-CZSL tasks. Specifically, the cross-primitive compatibility module explicitly learns to model the interactions of state and object features with the trainable memory units, which efficiently acquires cross-primitive visual attention to reason high-feasibility compositions, without the aid of external knowledge. Moreover, to alleviate the invalid cross-primitive interactions, especially for partial-supervision conditions (pCZSL), we design a progressive training paradigm to optimize the primitive classifiers conditioned on pre-trained features in an easy-to-hard manner. Extensive experiments on three widely used benchmark datasets demonstrate that our method outperforms other representative methods on both OW-CZSL and pCZSL settings by large margins. \ No newline at end of file diff --git a/data/2024/aaai/Probabilistic Neural Circuits b/data/2024/aaai/Probabilistic Neural Circuits new file mode 100644 index 0000000000..7d7012c7ec --- /dev/null +++ b/data/2024/aaai/Probabilistic Neural Circuits @@ -0,0 +1 @@ +Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators. \ No newline at end of file diff --git a/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation b/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation new file mode 100644 index 0000000000..958a08f433 --- /dev/null +++ b/data/2024/aaai/Probabilistic Offline Policy Ranking with Approximate Bayesian Computation @@ -0,0 +1 @@ +In practice, it is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability. Prior work seeks to solve this offline policy ranking (OPR) problem through value-based methods, such as Off-policy evaluation (OPE). However, they fail to analyze special case performance (e.g., worst or best cases), due to the lack of holistic characterization of policies’ performance. It is even more difficult to estimate precise policy values when the reward is not fully accessible under sparse settings. In this paper, we present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems by leveraging expert data to characterize the probability of a candidate policy behaving like experts, and approximating its entire performance posterior distribution to help with ranking. POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst-, best-, and average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based Approximate Bayesian Computation (ABC) method conducting likelihood-free inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy function, and improves the sampling efficiency by a pseudo-likelihood. We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment. \ No newline at end of file diff --git a/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect b/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect new file mode 100644 index 0000000000..8bf5cd4776 --- /dev/null +++ b/data/2024/aaai/Probabilities of Causation with Nonbinary Treatment and Effect @@ -0,0 +1 @@ +Probabilities of causation are proven to be critical in modern decision-making. This paper deals with the problem of estimating the probabilities of causation when treatment and effect are not binary. Pearl defined the binary probabilities of causation, such as the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). Tian and Pearl then derived sharp bounds for these probabilities of causation using experimental and observational data. In this paper, we define and provide theoretical bounds for all types of probabilities of causation with multivalued treatments and effects. We further discuss examples where our bounds guide practical decisions and use simulation studies to evaluate how informative the bounds are for various data combinations. \ No newline at end of file diff --git a/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation b/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation new file mode 100644 index 0000000000..4ab800aff6 --- /dev/null +++ b/data/2024/aaai/Probability-Polarized Optimal Transport for Unsupervised Domain Adaptation @@ -0,0 +1 @@ +Optimal transport (OT) is an important methodology to measure distribution discrepancy, which has achieved promising performance in artificial intelligence applications, e.g., unsupervised domain adaptation. However, from the view of transportation, there are still limitations: 1) the local discriminative structures for downstream tasks, e.g., cluster structure for classification, cannot be explicitly admitted by the learned OT plan; 2) the entropy regularization induces a dense OT plan with increasing uncertainty. To tackle these issues, we propose a novel Probability-Polarized OT (PPOT) framework, which can characterize the structure of OT plan explicitly. Specifically, the probability polarization mechanism is proposed to guide the optimization direction of OT plan, which generates a clear margin between similar and dissimilar transport pairs and reduces the uncertainty. Further, a dynamic mechanism for margin is developed by incorporating task-related information into the polarization, which directly captures the intra/inter class correspondence for knowledge transportation. A mathematical understanding for PPOT is provided from the view of gradient, which ensures interpretability. Extensive experiments on several datasets validate the effectiveness and empirical efficiency of PPOT. \ No newline at end of file diff --git a/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example b/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example new file mode 100644 index 0000000000..6cb843c5d6 --- /dev/null +++ b/data/2024/aaai/Procedural Level Generation with Diffusion Models from a Single Example @@ -0,0 +1 @@ +Level generation is a central focus of Procedural Content Generation (PCG), yet deep learning-based approaches are limited by scarce training data, i.e., human-designed levels. Despite being a dominant framework, Generative Adversarial Networks (GANs) exhibit a substantial quality gap between generated and human-authored levels, alongside rising training costs, particularly with increasing token complexity. In this paper, we introduce a diffusion-based generative model that learns from just one example. Our approach involves two core components: 1) an efficient yet expressive level representation, and 2) a latent denoising network with constrained receptive fields. To start with, our method utilizes token semantic labels, similar to word embeddings, to provide dense representations. This strategy not only surpasses one-hot encoding in representing larger game levels but also improves stability and accelerates convergence in latent diffusion. In addition, we adapt the denoising network architecture to confine the receptive field to localized patches of the data, aiming to facilitate single-example learning. Extensive experiments demonstrate that our model is capable of generating stylistically congruent samples of arbitrary sizes compared to manually designed levels. It suits a wide range of level structures with fewer artifacts than GAN-based approaches. The source code is available at https://github.com/shiqi-dai/diffusioncraft. \ No newline at end of file diff --git a/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint) b/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint) new file mode 100644 index 0000000000..e7b33ad68d --- /dev/null +++ b/data/2024/aaai/Program Synthesis with Best-First Bottom-Up Search (Abstract Reprint) @@ -0,0 +1 @@ +Cost-guided bottom-up search (BUS) algorithms use a cost function to guide the search to solve program synthesis tasks. In this paper, we show that current state-of-the-art cost-guided BUS algorithms suffer from a common problem: they can lose useful information given by the model and fail to perform the search in a best-first order according to a cost function. We introduce a novel best-first bottom-up search algorithm, which we call Bee Search, that does not suffer information loss and is able to perform cost-guided bottom-up synthesis in a best-first manner. Importantly, Bee Search performs best-first search with respect to the generation of programs, i.e., it does not even create in memory programs that are more expensive than the solution program. It attains best-first ordering with respect to generation by performing a search in an abstract space of program costs. We also introduce a new cost function that better uses the information provided by an existing cost model. Empirical results on string manipulation and bit-vector tasks show that Bee Search can outperform existing cost-guided BUS approaches when employing more complex domain-specific languages (DSLs); Bee Search and previous approaches perform equally well with simpler DSLs. Furthermore, our new cost function with Bee Search outperforms previous cost functions on string manipulation tasks. \ No newline at end of file diff --git a/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion b/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion new file mode 100644 index 0000000000..e2bf5c17c9 --- /dev/null +++ b/data/2024/aaai/Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion @@ -0,0 +1 @@ +In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline. \ No newline at end of file diff --git a/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation b/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation new file mode 100644 index 0000000000..3e6f06446a --- /dev/null +++ b/data/2024/aaai/Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation @@ -0,0 +1 @@ +Compared to conventional semantic segmentation with pixel-level supervision, weakly supervised semantic segmentation (WSSS) with image-level labels poses the challenge that it commonly focuses on the most discriminative regions, resulting in a disparity between weakly and fully supervision scenarios. A typical manifestation is the diminished precision on object boundaries, leading to deteriorated accuracy of WSSS. To alleviate this issue, we propose to adaptively partition the image content into certain regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing. For uncertain cues, we propose an adaptive masking strategy and seek to recover the local information with self-distilled knowledge. We further assume that confident regions should be robust enough to preserve the global semantics, and introduce a complementary self-distillation method that constrains semantic consistency between confident regions and an augmented view with the same class labels. Extensive experiments conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed single-stage approach for WSSS not only outperforms state-of-the-art counterparts but also surpasses multi-stage methods that trade complexity for accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation b/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation new file mode 100644 index 0000000000..e9454369f1 --- /dev/null +++ b/data/2024/aaai/Progressive High-Frequency Reconstruction for Pan-Sharpening with Implicit Neural Representation @@ -0,0 +1 @@ +Pan-sharpening aims to leverage the high-frequency signal of the panchromatic (PAN) image to enhance the resolution of its corresponding multi-spectral (MS) image. However, deep neural networks (DNNs) tend to prioritize learning the low-frequency components during the training process, which limits the restoration of high-frequency edge details in MS images. To overcome this limitation, we treat pan-sharpening as a coarse-to-fine high-frequency restoration problem and propose a novel method for achieving high-quality restoration of edge information in MS images. Specifically, to effectively obtain fine-grained multi-scale contextual features, we design a Band-limited Multi-scale High-frequency Generator (BMHG) that generates high-frequency signals from the PAN image within different bandwidths. During training, higher-frequency signals are progressively injected into the MS image, and corresponding residual blocks are introduced into the network simultaneously. This design enables gradients to flow from later to earlier blocks smoothly, encouraging intermediate blocks to concentrate on missing details. Furthermore, to address the issue of pixel position misalignment arising from multi-scale features fusion, we propose a Spatial-spectral Implicit Image Function (SIIF) that employs implicit neural representation to effectively represent and fuse spatial and spectral features in the continuous domain. Extensive experiments on different datasets demonstrate that our method outperforms existing approaches in terms of quantitative and visual measurements for high-frequency detail recovery. \ No newline at end of file diff --git a/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles b/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles new file mode 100644 index 0000000000..704e337b6e --- /dev/null +++ b/data/2024/aaai/Progressive Painterly Image Harmonization from Low-Level Styles to High-Level Styles @@ -0,0 +1 @@ +Painterly image harmonization aims to harmonize a photographic foreground object on the painterly background. Different from previous auto-encoder based harmonization networks, we develop a progressive multi-stage harmonization network, which harmonizes the composite foreground from low-level styles (e.g., color, simple texture) to high-level styles (e.g., complex texture). Our network has better interpretability and harmonization performance. Moreover, we design an early-exit strategy to automatically decide the proper stage to exit, which can skip the unnecessary and even harmful late stages. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our progressive harmonization network. \ No newline at end of file diff --git a/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction b/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction new file mode 100644 index 0000000000..501afe3161 --- /dev/null +++ b/data/2024/aaai/Progressive Text-to-Image Diffusion with Soft Latent Direction @@ -0,0 +1 @@ +In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards. \ No newline at end of file diff --git a/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process b/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process new file mode 100644 index 0000000000..f78ffff1d9 --- /dev/null +++ b/data/2024/aaai/Progressively Knowledge Distillation via Re-parameterizing Diffusion Reverse Process @@ -0,0 +1,9 @@ +Knowledge distillation aims at transferring knowledge from the teacher model to the student one by aligning their distributions. +Feature-level distillation often uses L2 distance or its variants as the loss function, based on the assumption that outputs follow normal distributions. +This poses a significant challenge when distribution gaps are substantial since this loss function ignores the variance term. +To address the problem, we propose to decompose the transfer objective into small parts and optimize it progressively. +This process is inspired by diffusion models from which the noise distribution is mapped to the target distribution step by step. +However, directly employing diffusion models is impractical in the distillation scenario due to its heavy reverse process. +To overcome this challenge, we adopt the structural re-parameterization technique to generate multiple student features to approximate the teacher features sequentially. +The multiple student features are combined linearly in inference time without extra cost. +We present extensive experiments performed on various transfer scenarios, such as CNN-to-CNN and Transformer-to-CNN, that validate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation b/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation new file mode 100644 index 0000000000..6e55f7ef83 --- /dev/null +++ b/data/2024/aaai/Project-Fair and Truthful Mechanisms for Budget Aggregation @@ -0,0 +1 @@ +We study the budget aggregation problem in which a set of strategic voters must split a finite divisible resource (such as money or time) among a set of competing projects. Our goal is twofold: We seek truthful mechanisms that provide fairness guarantees to the projects. For the first objective, we focus on the class of moving phantom mechanisms, which are -- to this day -- essentially the only known truthful mechanisms in this setting. For project fairness, we consider the mean division as a fair baseline, and bound the maximum difference between the funding received by any project and this baseline. We propose a novel and simple moving phantom mechanism that provides optimal project fairness guarantees. As a corollary of our results, we show that our new mechanism minimizes the L1 distance to the mean for three projects and gives the first non-trivial bounds on this quantity for more than three projects. \ No newline at end of file diff --git a/data/2024/aaai/Promoting Counterfactual Robustness through Diversity b/data/2024/aaai/Promoting Counterfactual Robustness through Diversity new file mode 100644 index 0000000000..519fc8a93e --- /dev/null +++ b/data/2024/aaai/Promoting Counterfactual Robustness through Diversity @@ -0,0 +1,12 @@ +Counterfactual explanations shed light on the decisions of black-box models by explaining +how an input can be altered to obtain a favourable decision from the model (e.g., when a loan application has been rejected). +However, as noted recently, counterfactual explainers may lack robustness in the sense that a minor change +in the input can cause a major change in the explanation. This can cause confusion on the user side and +open the door for adversarial attacks. In this paper, we study some sources of non-robustness. +While there are fundamental reasons for why an explainer that returns a single counterfactual cannot be +robust in all instances, we show that some interesting robustness guarantees can be given by reporting +multiple rather than a single counterfactual. Unfortunately, the number of counterfactuals that need to +be reported for the theoretical guarantees to hold can be prohibitively large. We therefore propose an approximation +algorithm that uses a diversity criterion to select a feasible number of most relevant explanations and study its robustness empirically. Our experiments indicate that our method improves the +state-of-the-art in generating robust explanations, while maintaining other desirable properties +and providing competitive computational performance. \ No newline at end of file diff --git a/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread b/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread new file mode 100644 index 0000000000..cc129fb57d --- /dev/null +++ b/data/2024/aaai/Promoting Fair Vaccination Strategies through Influence Maximization: A Case Study on COVID-19 Spread @@ -0,0 +1 @@ +The aftermath of the Covid-19 pandemic saw more severe outcomes for racial minority groups and economically-deprived communities. Such disparities can be explained by several factors, including unequal access to healthcare, as well as the inability of low income groups to reduce their mobility due to work or social obligations. Moreover, senior citizens were found to be more susceptible to severe symptoms, largely due to age-related health reasons. Adapting vaccine distribution strategies to consider a range of demographics is therefore essential to address these disparities. In this study, we propose a novel approach that utilizes influence maximization (IM) on mobility networks to develop vaccination strategies which incorporate demographic fairness. By considering factors such as race, social status, age, and associated risk factors, we aim to optimize vaccine distribution to achieve various fairness definitions for one or more protected attributes at a time. Through extensive experiments conducted on Covid-19 spread in three major metropolitan areas across the United States, we demonstrate the effectiveness of our proposed approach in reducing disease transmission and promoting fairness in vaccination distribution. \ No newline at end of file diff --git a/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals b/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals new file mode 100644 index 0000000000..9a9e9b7010 --- /dev/null +++ b/data/2024/aaai/Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals @@ -0,0 +1 @@ +Building teams and promoting collaboration are two very common business activities. An example of these are seen in the TeamingForFunding problem, where research institutions and researchers are interested to identify collaborative opportunities when applying to funding agencies in response to latter's calls for proposals. We describe a novel deployed system to recommend teams using a variety of AI methods, such that (1) each team achieves the highest possible skill coverage that is demanded by the opportunity, and (2) the workload of distributing the opportunities is balanced amongst the candidate members. We address these questions by extracting skills latent in open data of proposal calls (demand) and researcher profiles (supply), normalizing them using taxonomies, and creating efficient algorithms that match demand to supply. We create teams to maximize goodness along a novel metric balancing short- and long-term objectives. We validate the success of our algorithms (1) quantitatively, by evaluating the recommended teams using a goodness score and find that more informed methods lead to recommendations of smaller number of teams but higher goodness, and (2) qualitatively, by conducting a large-scale user study at a college-wide level, and demonstrate that users overall found the tool very useful and relevant. Lastly, we evaluate our system in two diverse settings in US and India (of researchers and proposal calls) to establish generality of our approach, and deploy it at a major US university for routine use. \ No newline at end of file diff --git a/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation b/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation new file mode 100644 index 0000000000..df2ce8dc40 --- /dev/null +++ b/data/2024/aaai/Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation @@ -0,0 +1 @@ +Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment. \ No newline at end of file diff --git a/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation b/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation new file mode 100644 index 0000000000..91caaf5dd9 --- /dev/null +++ b/data/2024/aaai/PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation @@ -0,0 +1 @@ +Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets. \ No newline at end of file diff --git a/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping b/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping new file mode 100644 index 0000000000..a03096d770 --- /dev/null +++ b/data/2024/aaai/Prompting Multi-Modal Image Segmentation with Semantic Grouping @@ -0,0 +1 @@ +Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propagation by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore, an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters. \ No newline at end of file diff --git a/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer b/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer new file mode 100644 index 0000000000..0378af69f8 --- /dev/null +++ b/data/2024/aaai/Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer @@ -0,0 +1 @@ +Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios. Project page: https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation \ No newline at end of file diff --git a/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making b/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making new file mode 100644 index 0000000000..663b00ba1a --- /dev/null +++ b/data/2024/aaai/Proportional Aggregation of Preferences for Sequential Decision Making @@ -0,0 +1 @@ +We study the problem of fair sequential decision making given voter preferences. In each round, a decision rule must choose a decision from a set of alternatives where each voter reports which of these alternatives they approve. Instead of going with the most popular choice in each round, we aim for proportional representation, using axioms inspired by the multi-winner voting literature. The axioms require that every group of α% of the voters, if it agrees in every round (i.e., approves a common alternative), then those voters must approve at least α% of the decisions. A stronger version of the axioms requires that every group of α% of the voters that agrees in a β fraction of rounds must approve β⋅α% of the decisions. We show that three attractive voting rules satisfy axioms of this style. One of them (Sequential Phragmén) makes its decisions online, and the other two satisfy strengthened versions of the axioms but make decisions semi-online (Method of Equal Shares) or fully offline (Proportional Approval Voting). We present empirical results for these rules based on synthetic data and U.S. political elections. We also run experiments using the moral machine dataset about ethical dilemmas. We train preference models on user responses from different countries and let the models cast votes. We find that aggregating these votes using our rules leads to a more equal utility distribution across demographics than making decisions using a single global preference model. \ No newline at end of file diff --git a/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers b/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers new file mode 100644 index 0000000000..e0cf246e89 --- /dev/null +++ b/data/2024/aaai/Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers @@ -0,0 +1,7 @@ +In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. +However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. +In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. +By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. +This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. +To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. +These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins. \ No newline at end of file diff --git a/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees b/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees new file mode 100644 index 0000000000..fe43d5c761 --- /dev/null +++ b/data/2024/aaai/Protect Your Score: Contact-Tracing with Differential Privacy Guarantees @@ -0,0 +1 @@ +The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication of a risk score. Yet, it is precisely the communication and release of this score to a user that an adversary can leverage to gauge the private health status of an individual. We pinpoint a realistic attack scenario and propose a contact tracing algorithm with differential privacy guarantees against this attack. The algorithm is tested on the two most widely used agent-based COVID19 simulators and demonstrates superior performance in a wide range of settings. Especially for realistic test scenarios and while releasing each risk score with epsilon=1 differential privacy, we achieve a two to ten-fold reduction in the infection rate of the virus. To the best of our knowledge, this presents the first contact tracing algorithm with differential privacy guarantees when revealing risk scores for COVID19. \ No newline at end of file diff --git a/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks b/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks new file mode 100644 index 0000000000..3c20995f29 --- /dev/null +++ b/data/2024/aaai/Provable Robustness against a Union of L_0 Adversarial Attacks @@ -0,0 +1 @@ +Sparse or L0 adversarial attacks arbitrarily perturb an unknown subset of the features. L0 robustness analysis is particularly well-suited for heterogeneous (tabular) data where features have different types or scales. State-of-the-art L0 certified defenses are based on randomized smoothing and apply to evasion attacks only. This paper proposes feature partition aggregation (FPA) -- a certified defense against the union of L0 evasion, backdoor, and poisoning attacks. FPA generates its stronger robustness guarantees via an ensemble whose submodels are trained on disjoint feature sets. Compared to state-of-the-art L0 defenses, FPA is up to 3,000x faster and provides larger median robustness guarantees (e.g., median certificates of 13 pixels over 10 for CIFAR10, 12 pixels over 10 for MNIST, 4 features over 1 for Weather, and 3 features over 1 for Ames), meaning FPA provides the additional dimensions of robustness essentially for free. \ No newline at end of file diff --git a/data/2024/aaai/Provably Convergent Federated Trilevel Learning b/data/2024/aaai/Provably Convergent Federated Trilevel Learning new file mode 100644 index 0000000000..d62f871c23 --- /dev/null +++ b/data/2024/aaai/Provably Convergent Federated Trilevel Learning @@ -0,0 +1 @@ +Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes u-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed u-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the u-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain ϵ-stationary point is upper bounded by O(1/ϵ²). Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80%. \ No newline at end of file diff --git a/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs b/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs new file mode 100644 index 0000000000..e6d6342d7a --- /dev/null +++ b/data/2024/aaai/Provably Powerful Graph Neural Networks for Directed Multigraphs @@ -0,0 +1,2 @@ +This paper analyses a set of simple adaptations that transform standard message-passing Graph Neural Networks (GNN) into provably powerful directed multigraph neural networks. The adaptations include multigraph port numbering, ego IDs, and reverse message passing. We prove that the combination of these theoretically enables the detection of any directed subgraph pattern. To validate the effectiveness of our proposed adaptations in practice, we conduct experiments on synthetic subgraph detection tasks, which demonstrate outstanding performance with almost perfect results. +Moreover, we apply our proposed adaptations to two financial crime analysis tasks. We observe dramatic improvements in detecting money laundering transactions, improving the minority-class F1 score of a standard message-passing GNN by up to 30%, and closely matching or outperforming tree-based and GNN baselines. Similarly impressive results are observed on a real-world phishing detection dataset, boosting three standard GNNs’ F1 scores by around 15% and outperforming all baselines. An extended version with appendices can be found on arXiv: https://arxiv.org/abs/2306.11586. \ No newline at end of file diff --git a/data/2024/aaai/Providing Fair Recourse over Plausible Groups b/data/2024/aaai/Providing Fair Recourse over Plausible Groups new file mode 100644 index 0000000000..6efee70b36 --- /dev/null +++ b/data/2024/aaai/Providing Fair Recourse over Plausible Groups @@ -0,0 +1 @@ +Machine learning models now automate decisions in applications where we may wish to provide recourse to adversely affected individuals. In practice, existing methods to provide recourse return actions that fail to account for latent characteristics that are not captured in the model (e.g., age, sex, marital status). In this paper, we study how the cost and feasibility of recourse can change across these latent groups. We introduce a notion of group-level plausibility to identify groups of individuals with a shared set of latent characteristics. We develop a general-purpose clustering procedure to identify groups from samples. Further, we propose a constrained optimization approach to learn models that equalize the cost of recourse over latent groups. We evaluate our approach through an empirical study on simulated and real-world datasets, showing that it can produce models that have better performance in terms of overall costs and feasibility at a group level. \ No newline at end of file diff --git a/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection b/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection new file mode 100644 index 0000000000..3e71823393 --- /dev/null +++ b/data/2024/aaai/ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection @@ -0,0 +1 @@ +Open-vocabulary object detection (OVOD) aims to recognize novel objects whose categories are not included in the training set. In order to classify these unseen classes during training, many OVOD frameworks leverage the zero-shot capability of largely pretrained vision and language models, such as CLIP. To further improve generalization on the unseen novel classes, several approaches proposed to additionally train with pseudo region labeling on the external data sources that contain a substantial number of novel category labels beyond the existing training data. Albeit its simplicity, these pseudo-labeling methods still exhibit limited improvement with regard to the truly unseen novel classes that were not pseudo-labeled. In this paper, we present a novel, yet simple technique that helps generalization on the overall distribution of novel classes. Inspired by our observation that numerous novel classes reside within the convex hull constructed by the base (seen) classes in the CLIP embedding space, we propose to synthesize proxy-novel classes approximating novel classes via linear mixup between a pair of base classes. By training our detector with these synthetic proxy-novel classes, we effectively explore the embedding space of novel classes. The experimental results on various OVOD benchmarks such as LVIS and COCO demonstrate superior performance on novel classes compared to the other state-of-the-art methods. Code is available at https://github.com/clovaai/ProxyDet. \ No newline at end of file diff --git "a/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens" "b/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens" new file mode 100644 index 0000000000..6345eef54c --- /dev/null +++ "b/data/2024/aaai/Proxyformer: Nystr\303\266m-Based Linear Transformer with Trainable Proxy Tokens" @@ -0,0 +1 @@ +Transformer-based models have demonstrated remarkable performance in various domains, including natural language processing, image processing and generative modeling. The most significant contributor to the successful performance of Transformer models is the self-attention mechanism, which allows for a comprehensive understanding of the interactions between tokens in the input sequence. However, there is a well-known scalability issue, the quadratic dependency (i.e. O(n^2)) of self-attention operations on the input sequence length n, making the handling of lengthy sequences challenging. To address this limitation, there has been a surge of research on efficient transformers, aiming to alleviate the quadratic dependency on the input sequence length. Among these, the Nyströmformer, which utilizes the Nyström method to decompose the attention matrix, achieves superior performance in both accuracy and throughput. However, its landmark selection exhibits redundancy, and the model incurs computational overhead when calculating the pseudo-inverse matrix. We propose a novel Nyström method-based transformer, called Proxyformer. Unlike the traditional approach of selecting landmarks from input tokens, the Proxyformer utilizes trainable neural memory, called proxy tokens, for landmarks. By integrating contrastive learning, input injection, and a specialized dropout for the decomposed matrix, Proxyformer achieves top-tier performance for long sequence tasks in the Long Range Arena benchmark. \ No newline at end of file diff --git a/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment b/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment new file mode 100644 index 0000000000..ab95463ea3 --- /dev/null +++ b/data/2024/aaai/Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment @@ -0,0 +1 @@ +Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a Pseudo-label Calibration Multi-modal Entity Alignment (PCMEA) in a semi-supervised way. Specifically, in order to generate holistic entity representations, we first devise various embedding modules and attention mechanisms to extract visual, structural, relational, and attribute features. Different from the prior direct fusion methods, we next propose to exploit mutual information maximization to filter the modal-specific noise and to augment modal-invariant commonality. Then, we combine pseudo-label calibration with momentum-based contrastive learning to make full use of the labeled and unlabeled data, which improves the quality of pseudo-label and pulls aligned entities closer. Finally, extensive experiments on two MMEA datasets demonstrate the effectiveness of our PCMEA, which yields state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/Pure-Past Action Masking b/data/2024/aaai/Pure-Past Action Masking new file mode 100644 index 0000000000..df3f98d0b4 --- /dev/null +++ b/data/2024/aaai/Pure-Past Action Masking @@ -0,0 +1 @@ +We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice. \ No newline at end of file diff --git a/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention b/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention new file mode 100644 index 0000000000..9708ee260e --- /dev/null +++ b/data/2024/aaai/Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention @@ -0,0 +1 @@ +Due to the scarcity of training samples, Few-Shot Learning (FSL) poses a significant challenge to capture discriminative object features effectively. The combination of transfer learning and meta-learning has recently been explored by pre-training the backbone features using labeled base data and subsequently fine-tuning the model with target data. However, existing meta-learning methods, which use embedding networks, suffer from scaling limitations when dealing with a few labeled samples, resulting in suboptimal results. Inspired by the latest advances in FSL, we further advance the approach of fine-tuning a pre-trained architecture by a strengthened hierarchical feature representation. The technical contributions of this work include: 1) a hybrid design named Intra-Block Fusion (IBF) to strengthen the extracted features within each convolution block; and 2) a novel Cross-Scale Attention (CSA) module to mitigate the scaling inconsistencies arising from the limited training samples, especially for cross-domain tasks. We conducted comprehensive evaluations on standard benchmarks, including three in-domain tasks (miniImageNet, CIFAR-FS, and FC100), as well as two cross-domain tasks (CDFSL and Meta-Dataset). The results have improved significantly over existing state-of-the-art approaches on all benchmark datasets. In particular, the FSL performance on the in-domain FC100 dataset is more than three points better than the latest PMF (Hu et al. 2022). \ No newline at end of file diff --git a/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks b/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks new file mode 100644 index 0000000000..3fd5e0b566 --- /dev/null +++ b/data/2024/aaai/Q-SENN: Quantized Self-Explaining Neural Networks @@ -0,0 +1 @@ +Explanations in Computer Vision are often desired, but most Deep Neural Networks can only provide saliency maps with questionable faithfulness. Self-Explaining Neural Networks (SENN) extract interpretable concepts with fidelity, diversity, and grounding to combine them linearly for decision-making. While they can explain what was recognized, initial realizations lack accuracy and general applicability. We propose the Quantized-Self-Explaining Neural Network “Q-SENN”. Q-SENN satisfies or exceeds the desiderata of SENN while being applicable to more complex datasets and maintaining most or all of the accuracy of an uninterpretable baseline model, outperforming previous work in all considered metrics. Q-SENN describes the relationship between every class and feature as either positive, negative or neutral instead of an arbitrary number of possible relations, enforcing more binary human-friendly features. Since every class is assigned just 5 interpretable features on average, Q-SENN shows convincing local and global interpretability. Additionally, we propose a feature alignment method, capable of aligning learned features with human language-based concepts without additional supervision. Thus, what is learned can be more easily verbalized. The code is published: https://github.com/ThomasNorr/Q-SENN \ No newline at end of file diff --git a/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models b/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models new file mode 100644 index 0000000000..95633d2ef7 --- /dev/null +++ b/data/2024/aaai/QCS-SGM+: Improved Quantized Compressed Sensing with Score-Based Generative Models @@ -0,0 +1 @@ +In practical compressed sensing (CS), the obtained measurements typically necessitate quantization to a limited number of bits prior to transmission or storage. This nonlinear quantization process poses significant recovery challenges, particularly with extreme coarse quantization such as 1-bit. Recently, an efficient algorithm called QCS-SGM was proposed for quantized CS (QCS) which utilizes score-based generative models (SGM) as an implicit prior. Due to the adeptness of SGM in capturing the intricate structures of natural signals, QCS-SGM substantially outperforms previous QCS methods. However, QCS-SGM is constrained to (approximately) row-orthogonal sensing matrices as the computation of the likelihood score becomes intractable otherwise. To address this limitation, we introduce an advanced variant of QCS-SGM, termed QCS-SGM+, capable of handling general matrices effectively. The key idea is a Bayesian inference perspective on the likelihood score computation, wherein expectation propagation is employed for its approximate computation. Extensive experiments are conducted, demonstrating the substantial superiority of QCS-SGM+ over QCS-SGM for general sensing matrices beyond mere row-orthogonality. \ No newline at end of file diff --git a/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos b/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos new file mode 100644 index 0000000000..5fb85b35db --- /dev/null +++ b/data/2024/aaai/QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos @@ -0,0 +1 @@ +In this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target video using a single query image of the object. Toward addressing this challenging problem, we extend a popular and successful object detection method, namely DETR (Detection Transformer), and introduce a novel approach –query-guided detection transformer for videos (QDETRv). A distinctive feature of QDETRv is its capacity to exploit information from the query image and spatio-temporal context of the target video, which significantly aids in precisely pinpointing the desired object in the video. We incorporate cross-attention mechanisms that capture temporal relationships across adjacent frames to handle the dynamic context in videos effectively. Further, to ensure strong initialization for QDETRv, we also introduce a novel unsupervised pretraining technique tailored to videos. This involves training our model on synthetic object trajectories with an analogous objective as the query-guided localization task. During this pretraining phase, we incorporate recurrent object queries and loss functions that encourage accurate patch feature reconstruction. These additions enable better temporal understanding and robust representation learning. Our experiments show that the proposed model significantly outperforms the competitive baselines on two public benchmarks, VidOR and ImageNet-VidVRD, extended for one-shot open-set localization tasks. \ No newline at end of file diff --git a/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification b/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification new file mode 100644 index 0000000000..a326f7ce26 --- /dev/null +++ b/data/2024/aaai/QI-IRA: Quantum-Inspired Interactive Ranking Aggregation for Person Re-identification @@ -0,0 +1 @@ +Ranking aggregation (RA), the process of aggregating multiple rankings derived from multiple search strategies, has been proved effective in person re-identification (re-ID) because of a single re-ID method can not always achieve consistent superiority for different scenarios. Existing RA research mainly focus on unsupervised and fully-supervised methods. The former lack external supervision to optimize performance, while the latter are costly because of expensive labeling effort required for training. To address the above challenges, this paper proposes a quantum-inspired interactive ranking aggregation (QI-IRA) method, which (1) utilizes quantum theory to interpret and model the generation and aggregation of multiple basic rankings, (2) approximates or even exceeds the performance of fully-supervised RA methods with much less labeling cost, even as low as only two feedbacks per query on Market1501, MARS and DukeMTMC-VideoReID datasets. Comparative experiments conducted on six public re-ID datasets validate the superiority of the proposed QI-IRA method over existing unsupervised, interactive, and fully-supervised RA approaches. \ No newline at end of file diff --git a/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning b/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning new file mode 100644 index 0000000000..197e841dff --- /dev/null +++ b/data/2024/aaai/QLABGrad: A Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning @@ -0,0 +1,18 @@ +The learning rate is a critical hyperparameter for deep learning +tasks since it determines the extent to which the model +parameters are adjusted during the learning course. However, +the choice of learning rates typically depends on empirical +judgment, which may not result in satisfactory outcomes +without intensive try-and-error experiments. In this +study, we propose a novel learning rate adaptation scheme +called QLABGrad. Without any user-specified hyperparameter, +QLABGrad automatically determines the learning rate by +optimizing the quadratic loss approximation-based (QLAB) +function for a given gradient descent direction, where only +one extra forward propagation is required. We theoretically +prove the convergence of QLABGrad under the smooth Lipschitz +condition on the loss function. Experiment results on +multiple architectures, including MLP, CNN, and ResNet, on +MNIST, CIFAR10, and ImageNet datasets, demonstrate that +QLABGrad outperforms widely adopted schemes for deep +learning. \ No newline at end of file diff --git a/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis b/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis new file mode 100644 index 0000000000..016d8264eb --- /dev/null +++ b/data/2024/aaai/QPEN: Quantum Projection and Quantum Entanglement Enhanced Network for Cross-Lingual Aspect-Based Sentiment Analysis @@ -0,0 +1 @@ +Aspect-based sentiment analysis (ABSA) has attracted much attention due to its wide application scenarios. Most previous studies have focused solely on monolingual ABSA, posing a formidable challenge when extending ABSA applications to multilingual scenarios. In this paper, we study upgrading monolingual ABSA to cross-lingual ABSA. Existing methods usually exploit pre-trained cross-lingual language to model cross-lingual ABSA, and enhance the model with translation data. However, the low-resource languages might be under-represented during the pre-training phase, and the translation-enhanced methods heavily rely on the quality of the translation and label projection. Inspired by the observation that quantum entanglement can correlate multiple single systems, we map the monolingual expression to the quantum Hilbert space as a single quantum system, and then utilize quantum entanglement and quantum measurement to achieve cross-lingual ABSA. Specifically, we propose a novel quantum neural model named QPEN (short for quantum projection and quantum entanglement enhanced network). It is equipped with a proposed quantum projection module that projects aspects as quantum superposition on a complex-valued Hilbert space. Furthermore, a quantum entanglement module is proposed in QPEN to share language-specific features between different languages without transmission. We conducted simulation experiments on the classical computer, and experimental results on SemEval-2016 dataset demonstrate that our method achieves state-of-the-art performance in terms of F1-scores for five languages. \ No newline at end of file diff --git a/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning b/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning new file mode 100644 index 0000000000..0488dcb972 --- /dev/null +++ b/data/2024/aaai/Quad Bayer Joint Demosaicing and Denoising Based on Dual Encoder Network with Joint Residual Learning @@ -0,0 +1 @@ +The recent imaging technology Quad Bayer CFA brings better imaging PSNR and higher visual quality compared to traditional Bayer CFA, but also serious challenges for demosaicing and denoising during the ISP pipeline. In this paper, we propose a novel dual encoder network, namely DRNet, to achieve joint demosaicing and denoising for Quad Bayer CFA. The dual encoders are carefully designed in that one is mainly constructed by a joint residual block to jointly estimate the residuals for demosaicing and denoising separately. In contrast, the other one is started with a pixel modulation block which is specially designed to match the characteristics of Quad Bayer pattern for better feature extraction. We demonstrate the effectiveness of each proposed component through detailed ablation investigations. The comparison results on public benchmarks illustrate that our DRNet achieves an apparent performance gain~(0.38dB to the 2nd best) from the state-of-the-art method and balances performance and efficiency well. The experiments on real-world images show that the proposed method could enhance the reconstruction quality from the native ISP algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data b/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data new file mode 100644 index 0000000000..6a9ea87ae1 --- /dev/null +++ b/data/2024/aaai/Quality-Diversity Generative Sampling for Learning with Synthetic Data @@ -0,0 +1 @@ +Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling. \ No newline at end of file diff --git a/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense b/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense new file mode 100644 index 0000000000..dbcb06eb88 --- /dev/null +++ b/data/2024/aaai/Quantifying Political Polarization through the Lens of Machine Translation and Vicarious Offense @@ -0,0 +1,5 @@ +This talk surveys three related research contributions that shed light on the current US political divide: + +1. a novel machine-translation-based framework to quantify political polarization; +2. an analysis of disparate media portrayal of US policing in major cable news outlets; and +3. a novel perspective of vicarious offense that examines a timely and important question -- how well do Democratic-leaning users perceive what content would be deemed as offensive by their Republican-leaning counterparts or vice-versa? \ No newline at end of file diff --git a/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models b/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models new file mode 100644 index 0000000000..4adaa06230 --- /dev/null +++ b/data/2024/aaai/Quantifying and Analyzing Entity-Level Memorization in Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. We find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. The results demonstrate that LLMs not only memorize their training data but also understand associations between entities. These findings necessitate that trainers of LLMs exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations. \ No newline at end of file diff --git a/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection b/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection new file mode 100644 index 0000000000..c8f9994bf4 --- /dev/null +++ b/data/2024/aaai/Quantile-Based Maximum Likelihood Training for Outlier Detection @@ -0,0 +1 @@ +Discriminative learning effectively predicts true object class for image classification. However, it often results in false positives for outliers, posing critical concerns in applications like autonomous driving and video surveillance systems. Previous attempts to address this challenge involved training image classifiers through contrastive learning using actual outlier data or synthesizing outliers for self-supervised learning. Furthermore, unsupervised generative modeling of inliers in pixel space has shown limited success for outlier detection. In this work, we introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference. Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood. The experimental evaluation demonstrates the effectiveness of our method as it surpasses the performance of the state-of-the-art unsupervised methods for outlier detection. The results are also competitive compared with a recent self-supervised approach for outlier detection. Our work allows to reduce dependency on well-sampled negative training data, which is especially important for domains like medical diagnostics or remote sensing. \ No newline at end of file diff --git a/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation b/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation new file mode 100644 index 0000000000..43a17f3add --- /dev/null +++ b/data/2024/aaai/Quantile-Regression-Ensemble: A Deep Learning Algorithm for Downscaling Extreme Precipitation @@ -0,0 +1 @@ +Global Climate Models (GCMs) simulate low resolution climate projections on a global scale. The native resolution of GCMs is generally too low for societal-level decision-making. To enhance the spatial resolution, downscaling is often applied to GCM output. Statistical downscaling techniques, in particular, are well-established as a cost-effective approach. They require significantly less computational time than physics-based dynamical downscaling. In recent years, deep learning has gained prominence in statistical downscaling, demonstrating significantly lower error rates compared to traditional statistical methods. However, a drawback of regression-based deep learning techniques is their tendency to overfit to the mean sample intensity. Extreme values as a result are often underestimated. Problematically, extreme events have the largest societal impact. We propose Quantile-Regression-Ensemble (QRE), an innovative deep learning algorithm inspired by boosting methods. Its primary objective is to avoid trade-offs between fitting to sample means and extreme values by training independent models on a partitioned dataset. Our QRE is robust to redundant models and not susceptible to explosive ensemble weights, ensuring a reliable training process. QRE achieves lower Mean Squared Error (MSE) compared to various baseline models. In particular, our algorithm has a lower error for high-intensity precipitation events over New Zealand, highlighting the ability to represent extreme events accurately. \ No newline at end of file diff --git a/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation b/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation new file mode 100644 index 0000000000..ccc47249c3 --- /dev/null +++ b/data/2024/aaai/Quantum Interference Model for Semantic Biases of Glosses in Word Sense Disambiguation @@ -0,0 +1 @@ +Word Sense Disambiguation (WSD) aims to determine the meaning of the target word according to the given context. Currently, a single representation enhanced by glosses from different dictionaries or languages is used to characterize each word sense. By analyzing the similarity between glosses of the same word sense, we find semantic biases among them, revealing that the glosses have their own descriptive perspectives. Therefore, the traditional approach of integrating all glosses by a single representation results in failing to present the unique semantics revealed by the individual glosses. In this paper, a quantum superposition state is employed to formalize the representations of multiple glosses of the same word sense to reveal their distributions. Furthermore, the quantum interference model is leveraged to calculate the probability that the target word belongs to this superposition state. The advantage is that the interference term can be regarded as a confidence level to guide word sense recognition. Finally, experiments are performed under standard WSD evaluation framework and the latest cross-lingual datasets, and the results verify the effectiveness of our model. \ No newline at end of file diff --git a/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method b/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method new file mode 100644 index 0000000000..146a5b943b --- /dev/null +++ b/data/2024/aaai/Quantum-Inspired Neural Network with Runge-Kutta Method @@ -0,0 +1 @@ +In recent years, researchers have developed novel Quantum-Inspired Neural Network (QINN) frameworks for the Natural Language Processing (NLP) tasks, inspired by the theoretical investigations of quantum cognition. However, we have found that the training efficiency of QINNs is significantly lower than that of classical networks. We analyze the unitary transformation modules of existing QINNs based on the time displacement symmetry of quantum mechanics and discover that they are resembling a mathematical form similar to the first-order Euler method. The high truncation error associated with Euler method affects the training efficiency of QINNs. In order to enhance the training efficiency of QINNs, we generalize QINNs' unitary transformation modules to the Quantum-like high-order Runge-Kutta methods (QRKs). Moreover, we present the results of experiments on conversation emotion recognition and text classification tasks to validate the effectiveness of the proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters b/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters new file mode 100644 index 0000000000..f71436a148 --- /dev/null +++ b/data/2024/aaai/QuerySum: A Multi-Document Query-Focused Summarization Dataset Augmented with Similar Query Clusters @@ -0,0 +1 @@ +Query-focused summarization (QFS) aims to summarize the source document(s) with regard to a specific aspect of information given in a query. It plays an important role in presenting users with a concise answer summary from a set of query-relevant documents retrieved by the information retrieval system. Nonetheless, the QFS research has long been hampered by the lack of adequate datasets in terms of both quality and quantity. In this paper, we introduce a large-scale multi-document query-focused summarization dataset, called QuerySum, which contains 27,041 data samples covering diverse topics and its quality is guaranteed through human verification. Unlike some previous QFS datasets constructed directly from the question answering datasets, 74% queries in our dataset are the challenging non-factoid What-, Why-, and How- questions. More importantly, we also provide a set of similar queries together with the corresponding summaries pairs for each query as the retrieved context, presenting a new feature of QuerySum. We aim to encourage research efforts in query intention understanding in the context of QFS. Leveraging QuerySum's depth, we propose a model for query-aware multi-document summarization and set a new QFS benchmark. \ No newline at end of file diff --git a/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering b/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering new file mode 100644 index 0000000000..4f6da87f30 --- /dev/null +++ b/data/2024/aaai/Question Calibration and Multi-Hop Modeling for Temporal Question Answering @@ -0,0 +1 @@ +Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) network. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions. \ No newline at end of file diff --git a/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract) b/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract) new file mode 100644 index 0000000000..d422b5bc95 --- /dev/null +++ b/data/2024/aaai/QuickRender: A Photorealistic Procedurally Generated Dataset with Applications to Super Resolution (Student Abstract) @@ -0,0 +1,5 @@ +Rendering of complex scenes from software such as Blender is time consuming, but corresponding auxiliary data such as depth or object segmentation maps are relatively fast to generate. The auxiliary data also provides a wealth of information for tasks such as optical flow prediction. + +In this paper we present the QuickRender dataset, a collection of procedurally generated scenes rendered into over 5,000 sequential image triplets along with accompanying auxiliary data. The goal of this dataset is to provide a diversity of scenes and motion while maintaining realistic behaviours. A sample application using this dataset to perform single image super resolution is also presented. + +The dataset and related source code can be found at https://github.com/MP-mtroyal/MetaSRGAN. \ No newline at end of file diff --git a/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts b/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts new file mode 100644 index 0000000000..1ceaac5925 --- /dev/null +++ b/data/2024/aaai/Quilt: Robust Data Segment Selection against Concept Drifts @@ -0,0 +1 @@ +Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets. \ No newline at end of file diff --git a/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion b/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion new file mode 100644 index 0000000000..6b5376477c --- /dev/null +++ b/data/2024/aaai/R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion @@ -0,0 +1 @@ +Image generation tasks have achieved remarkable performance using large-scale diffusion models. However, these models are limited to capturing the abstract relations (viz., interactions excluding positional relations) among multiple entities of complex scene graphs. Two main problems exist: 1) fail to depict more concise and accurate interactions via abstract relations; 2) fail to generate complete entities. To address that, we propose a novel Relation-aware Compositional Contrastive Control Diffusion method, dubbed as R3CD, that leverages large-scale diffusion models to learn abstract interactions from scene graphs. Herein, a scene graph transformer based on node and edge encoding is first designed to perceive both local and global information from input scene graphs, whose embeddings are initialized by a T5 model. Then a joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model to understand and further generate images, whose spatial structures and interaction features are consistent with a priori relation. Extensive experiments are conducted on two datasets: Visual Genome and COCO-Stuff, and demonstrate that the proposal outperforms existing models both in quantitative and qualitative metrics to generate more realistic and diverse images according to different scene graph specifications. \ No newline at end of file diff --git a/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling b/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling new file mode 100644 index 0000000000..384a11b9f1 --- /dev/null +++ b/data/2024/aaai/READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling @@ -0,0 +1 @@ +Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter’s low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ-PVLA framework through extensive experiments where READ-PVLA significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties b/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties new file mode 100644 index 0000000000..96e390a82b --- /dev/null +++ b/data/2024/aaai/REGLO: Provable Neural Network Repair for Global Robustness Properties @@ -0,0 +1 @@ +We present REGLO, a novel methodology for repairing pretrained neural networks to satisfy global robustness and individual fairness properties. A neural network is said to be globally robust with respect to a given input region if and only if all the input points in the region are locally robust. This notion of global robustness also captures the notion of individual fairness as a special case. We prove that any counterexample to a global robustness property must exhibit a corresponding large gradient. For ReLU networks, this result allows us to efficiently identify the linear regions that violate a given global robustness property. By formulating and solving a suitable robust convex optimization problem, REGLO then computes a minimal weight change that will provably repair these violating linear regions. \ No newline at end of file diff --git a/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection b/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection new file mode 100644 index 0000000000..16826bb9fd --- /dev/null +++ b/data/2024/aaai/REPrune: Channel Pruning via Kernel Representative Selection @@ -0,0 +1 @@ +Channel pruning is widely accepted to accelerate modern convolutional neural networks (CNNs). The resulting pruned model benefits from its immediate deployment on general-purpose software and hardware resources. However, its large pruning granularity, specifically at the unit of a convolution filter, often leads to undesirable accuracy drops due to the inflexibility of deciding how and where to introduce sparsity to the CNNs. In this paper, we propose REPrune, a novel channel pruning technique that emulates kernel pruning, fully exploiting the finer but structured granularity. REPrune identifies similar kernels within each channel using agglomerative clustering. Then, it selects filters that maximize the incorporation of kernel representatives while optimizing the maximum cluster coverage problem. By integrating with a simultaneous training-pruning paradigm, REPrune promotes efficient, progressive pruning throughout training CNNs, avoiding the conventional train-prune-finetune sequence. Experimental results highlight that REPrune performs better in computer vision tasks than existing methods, effectively achieving a balance between acceleration ratio and performance retention. \ No newline at end of file diff --git a/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks b/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks new file mode 100644 index 0000000000..444dcbdb78 --- /dev/null +++ b/data/2024/aaai/RG-GAN: Dynamic Regenerative Pruning for Data-Efficient Generative Adversarial Networks @@ -0,0 +1 @@ +Training Generative Adversarial Networks (GAN) to generate high-quality images typically requires large datasets. Network pruning during training has recently emerged as a significant advancement for data-efficient GAN. However, simple and straightforward pruning can lead to the risk of losing key information, resulting in suboptimal results due to GAN’s competitive dynamics between generator (G) and discriminator (D). Addressing this, we present RG-GAN, a novel approach that marks the first incorporation of dynamic weight regeneration and pruning in GAN training to improve the quality of the generated samples, even with limited data. Specifically, RG-GAN initiates layer-wise dynamic pruning by removing less important weights to the quality of the generated images. While pruning enhances efficiency, excessive sparsity within layers can pose a risk of model collapse. To mitigate this issue, RG-GAN applies a dynamic regeneration method to reintroduce specific weights when they become important, ensuring a balance between sparsity and image quality. Though effective, the sparse network achieved through this process might eliminate some weights important to the combined G and D performance, a crucial aspect for achieving stable and effective GAN training. RG-GAN addresses this loss of weights by integrating learned sparse network weights back into the dense network at the previous stage during a follow-up regeneration step. Our results consistently demonstrate RG-GAN’s robust performance across a variety of scenarios, including different GAN architectures, datasets, and degrees of data scarcity, reinforcing its value as a generic training methodology. Results also show that data augmentation exhibits improved performance in conjunction with RG-GAN. Furthermore, RG-GAN can achieve fewer parameters without compromising, and even enhancing, the quality of the generated samples. Code can be found at this link: https://github.com/IntellicentAI-Lab/RG-GAN \ No newline at end of file diff --git a/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning b/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..c84f89d60b --- /dev/null +++ b/data/2024/aaai/RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning @@ -0,0 +1,2 @@ +Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in partially observable Markov Decision Processes. Existing works often rely on black-box methods to encode local information/features into messages shared with other agents, leading to the generation of continuous messages with high communication overhead and poor interpretability. Prior attempts at discrete communication methods generate one-hot vectors trained as part of agents' actions and use the Gumbel softmax operation for calculating message gradients, which are all heuristic designs that do not provide any quantitative guarantees on the expected return. +This paper establishes an upper bound on the return gap between an ideal policy with full observability and an optimal partially observable policy with discrete communication. This result enables us to recast multi-agent communication into a novel online clustering problem over the local observations at each agent, with messages as cluster labels and the upper bound on the return gap as clustering loss. To minimize the return gap, we propose the Return-Gap-Minimization Communication (RGMComm) algorithm, which is a surprisingly simple design of discrete message generation functions and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss function, which incorporates cosine-distance as the clustering metric. Evaluations show that RGMComm significantly outperforms state-of-the-art multi-agent communication baselines and can achieve nearly optimal returns with few-bit messages that are naturally interpretable. \ No newline at end of file diff --git a/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing b/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing new file mode 100644 index 0000000000..022b79a027 --- /dev/null +++ b/data/2024/aaai/RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing @@ -0,0 +1 @@ +Hardware image signal processing (ISP), aiming at converting RAW inputs to RGB images, consists of a series of processing blocks, each with multiple parameters. Traditionally, ISP parameters are manually tuned in isolation by imaging experts according to application-specific quality and performance metrics, which is time-consuming and biased towards human perception due to complex interaction with the output image. Since the relationship between any single parameter’s variation and the output performance metric is a complex, non-linear function, optimizing such a large number of ISP parameters is challenging. To address this challenge, we propose a novel Sequential ISP parameter optimization model, called the RL-SeqISP model, which utilizes deep reinforcement learning to jointly optimize all ISP parameters for a variety of imaging applications. Concretely, inspired by the sequential tuning process of human experts, the proposed model can progressively enhance image quality by seamlessly integrating information from both the image feature space and the parameter space. Furthermore, a dynamic parameter optimization module is introduced to avoid ISP parameters getting stuck into local optima, which is able to more effectively guarantee the optimal parameters resulting from the sequential learning strategy. These merits of the RL-SeqISP model as well as its high efficiency are substantiated by comprehensive experiments on a wide range of downstream tasks, including two visual analysis tasks (instance segmentation and object detection), and image quality assessment (IQA), as compared with representative methods both quantitatively and qualitatively. In particular, even using only 10% of the training data, our model outperforms other SOTA methods by an average of 7% mAP on two visual analysis tasks. \ No newline at end of file diff --git a/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction b/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction new file mode 100644 index 0000000000..58a9073be0 --- /dev/null +++ b/data/2024/aaai/RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature Extraction @@ -0,0 +1,3 @@ +Visual perimetry is an important eye examination that helps detect vision problems caused by ocular or neurological conditions. During the test, a patient's gaze is fixed at a specific location while light stimuli of varying intensities are presented in central and peripheral vision. Based on the patient's responses to the stimuli, the visual field mapping and sensitivity are determined. However, maintaining high levels of concentration throughout the test can be challenging for patients, leading to increased examination times and decreased accuracy. + +In this work, we present RLPeri, a reinforcement learning-based approach to optimize visual perimetry testing. By determining the optimal sequence of locations and initial stimulus values, we aim to reduce the examination time without compromising accuracy. Additionally, we incorporate reward shaping techniques to further improve the testing performance. To monitor the patient's responses over time during testing, we represent the test's state as a pair of 3D matrices. We apply two different convolutional kernels to extract spatial features across locations as well as features across different stimulus values for each location. Through experiments, we demonstrate that our approach results in a 10-20% reduction in examination time while maintaining the accuracy as compared to state-of-the-art methods. With the presented approach, we aim to make visual perimetry testing more efficient and patient-friendly, while still providing accurate results. \ No newline at end of file diff --git a/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving b/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving new file mode 100644 index 0000000000..f543f96420 --- /dev/null +++ b/data/2024/aaai/RLfOLD: Reinforcement Learning from Online Demonstrations in Urban Autonomous Driving @@ -0,0 +1 @@ +Reinforcement Learning from Demonstrations (RLfD) has emerged as an effective method by fusing expert demonstrations into Reinforcement Learning (RL) training, harnessing the strengths of both Imitation Learning (IL) and RL. However, existing algorithms rely on offline demonstrations, which can introduce a distribution gap between the demonstrations and the actual training environment, limiting their performance. In this paper, we propose a novel approach, Reinforcement Learning from Online Demonstrations (RLfOLD), that leverages online demonstrations to address this limitation, ensuring the agent learns from relevant and up-to-date scenarios, thus effectively bridging the distribution gap. Unlike conventional policy networks used in typical actor-critic algorithms, RLfOLD introduces a policy network that outputs two standard deviations: one for exploration and the other for IL training. This novel design allows the agent to adapt to varying levels of uncertainty inherent in both RL and IL. Furthermore, we introduce an exploration process guided by an online expert, incorporating an uncertainty-based technique. Our experiments on the CARLA NoCrash benchmark demonstrate the effectiveness and efficiency of RLfOLD. Notably, even with a significantly smaller encoder and a single camera setup, RLfOLD surpasses state-of-the-art methods in this evaluation. These results, achieved with limited resources, highlight RLfOLD as a highly promising solution for real-world applications. \ No newline at end of file diff --git a/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning b/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning new file mode 100644 index 0000000000..88a6390e14 --- /dev/null +++ b/data/2024/aaai/ROG_PL: Robust Open-Set Graph Learning via Region-Based Prototype Learning @@ -0,0 +1 @@ +Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data. To this end, we propose a unified framework named ROG_PL to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG_PL consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem. The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG_PL is the first robust open-set node classification method for graph data with complex noise. Experimental evaluations of ROG_PL on several benchmark graph datasets demonstrate that it has good performance. \ No newline at end of file diff --git a/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering b/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering new file mode 100644 index 0000000000..7307d47456 --- /dev/null +++ b/data/2024/aaai/RPSC: Robust Pseudo-Labeling for Semantic Clustering @@ -0,0 +1 @@ +Clustering methods achieve performance improvement by jointly learning representation and cluster assignment. However, they do not consider the confidence of pseudo-labels which are not optimal as supervised information, resulting into error accumulation. To address this issue, we propose a Robust Pseudo-labeling for Semantic Clustering (RPSC) approach, which includes two stages. In the first stage (RPSC-Self), we design a semantic pseudo-labeling scheme by using the consistency of samples, i.e., samples with same semantics should be close to each other in the embedding space. To exploit robust semantic pseudo-labels for self-supervised learning, we propose a soft contrastive loss (SCL) which encourage the model to believe high-confidence sematic pseudo-labels and be less driven by low-confidence pseudo-labels. In the second stage (RPSC-Semi), we first determine the semantic pseudo-label of a sample based on the distance between itself and cluster centers, followed by screening out reliable semantic pseudo-label by exploiting the consistency. These reliable pseudo-labels are used as supervised information in the pseudo-semi-supervised learning algorithm to further improve the performance. Experimental results show that RPSC outperforms 18 competitive clustering algorithms significantly on six challenging image benchmarks. In particular, RPSC achieves an accuracy of 0.688 on ImageNet-Dogs, which is an up to 24% improvement, compared with the second-best method. Meanwhile, we conduct ablation studies to investigate effects of different augmented strategies on RPSC as well as contributions of terms in SCL to clustering performance. Besides, experimental results indicate that SCL can be easily integrated into existing clustering methods and bring performance improvement. \ No newline at end of file diff --git a/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection b/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection new file mode 100644 index 0000000000..c7d5b4dff3 --- /dev/null +++ b/data/2024/aaai/RR-PU: A Synergistic Two-Stage Positive and Unlabeled Learning Framework for Robust Tax Evasion Detection @@ -0,0 +1 @@ +Tax evasion, an unlawful practice in which taxpayers deliberately conceal information to avoid paying tax liabilities, poses significant challenges for tax authorities. Effective tax evasion detection is critical for assisting tax authorities in mitigating tax revenue loss. Recently, machine-learning-based methods, particularly those employing positive and unlabeled (PU) learning, have been adopted for tax evasion detection, achieving notable success. However, these methods exhibit two major practical limitations. First, their success heavily relies on the strong assumption that the label frequency (the fraction of identified taxpayers among tax evaders) is known in advance. Second, although some methods attempt to estimate label frequency using approaches like Mixture Proportion Estimation (MPE) without making any assumptions, they subsequently construct a classifier based on the error-prone label frequency obtained from the previous estimation. This two-stage approach may not be optimal, as it neglects error accumulation in classifier training resulting from the estimation bias in the first stage. To address these limitations, we propose a novel PU learning-based tax evasion detection framework called RR-PU, which can revise the bias in a two-stage synergistic manner. Specifically, RR-PU refines the label frequency initialization by leveraging a regrouping technique to fortify the MPE perspective. Subsequently, we integrate a trainable slack variable to fine-tune the initial label frequency, concurrently optimizing this variable and the classifier to eliminate latent bias in the initial stage. Experimental results on three real-world tax datasets demonstrate that RR-PU outperforms state-of-the-art methods in tax evasion detection tasks. \ No newline at end of file diff --git a/data/2024/aaai/RRL: Recommendation Reverse Learning b/data/2024/aaai/RRL: Recommendation Reverse Learning new file mode 100644 index 0000000000..3e24879c39 --- /dev/null +++ b/data/2024/aaai/RRL: Recommendation Reverse Learning @@ -0,0 +1 @@ +As societies become increasingly aware of data privacy, regulations require that private information about users must be removed from both database and ML models, which is more colloquially called `the right to be forgotten`. Such privacy problems of recommendation systems, which hold large amounts of private data, are drawing increasing attention. Recent research suggests dividing the preference data into multiple shards and training submodels with these shards and forgetting users' personal preference data by retraining the submodels of marked shards. Despite the computational efficiency development compared with retraining from scratch, the overall recommendation performance deteriorates after dividing the shards because the collaborative information contained in the training data is broken. In this paper, we aim to propose a forgetting framework for recommendation models that neither separate the training data nor jeopardizes the recommendation performance, named Recommendation Reverse Learning (RRL). Given the trained recommendation model and marked preference data, we devise Reverse BPR Objective (RBPR Objective) to fine-tune the recommendation model to force it to forget the marked data. Nevertheless, as the recommendation model encode the complex collaborative information among users, we propose to utilize Fisher Information Matrix (FIM) to estimate the influence of reverse learning on other users' collaborative information and guide the updates of representations. We conduct experiments on two representative recommendation models and three public benchmark datasets to verify the efficiency of RRL. To verify the forgetting completeness, we use RRL to make the recommendation model poisoned by shilling attacks forget malicious users. \ No newline at end of file diff --git a/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation b/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation new file mode 100644 index 0000000000..40b1fbab6a --- /dev/null +++ b/data/2024/aaai/RWMS: Reliable Weighted Multi-Phase for Semi-supervised Segmentation @@ -0,0 +1 @@ +Semantic segmentation is one of the tasks concerned in the field of computer vision. However, the cost of capturing large numbers of pixel-level annotations is expensive. Semi-supervised learning can utilize labeled and unlabeled data, providing new ideas for solving the problem of insufficient labeled data. In this work, we propose a data-reliability weighted multi-phase learning method for semi-supervised segmentation (RWMS). Under the framework of self-training, we train two different teacher models to evaluate the reliability of pseudo labels. By selecting reliable data at the image level and reweighting pseudo labels at the pixel level, multi-phase training is guided to focus on more reliable knowledge. Besides, we also inject strong data augmentations on unlabeled images while training. Through extensive experiments, we demonstrate that our method performs remarkably well compared to baseline methods and substantially outperforms them, more than 3% on VOC and Cityscapes. \ No newline at end of file diff --git a/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression b/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression new file mode 100644 index 0000000000..0d8902c5d0 --- /dev/null +++ b/data/2024/aaai/Racing Control Variable Genetic Programming for Symbolic Regression @@ -0,0 +1 @@ +Symbolic regression, as one of the most crucial tasks in AI for science, discovers governing equations from experimental data. Popular approaches based on genetic programming, Monte Carlo tree search, or deep reinforcement learning learn symbolic regression from a fixed dataset. These methods require massive datasets and long training time especially when learning complex equations involving many variables. Recently, Control Variable Genetic Programming (CVGP) has been introduced which accelerates the regression process by discovering equations from designed control variable experiments. However, the set of experiments is fixed a-priori in CVGP and we observe that sub-optimal selection of experiment schedules delay the discovery process significantly. To overcome this limitation, we propose Racing Control Variable Genetic Programming (Racing-CVGP), which carries out multiple experiment schedules simultaneously. A selection scheme similar to that used in selecting good symbolic equations in the genetic programming process is implemented to ensure that promising experiment schedules eventually win over the average ones. The unfavorable schedules are terminated early to save time for the promising ones. We evaluate Racing-CVGP on several synthetic and real-world datasets corresponding to true physics laws. We demonstrate that Racing-CVGP outperforms CVGP and a series of symbolic regressors which discover equations from fixed datasets. \ No newline at end of file diff --git a/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation b/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation new file mode 100644 index 0000000000..28e8a7a3ab --- /dev/null +++ b/data/2024/aaai/RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation @@ -0,0 +1 @@ +3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eye-view (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark. \ No newline at end of file diff --git a/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation b/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation new file mode 100644 index 0000000000..62d3164992 --- /dev/null +++ b/data/2024/aaai/RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation @@ -0,0 +1 @@ +Moving object segmentation (MOS) and Ego velocity estimation (EVE) are vital capabilities for mobile systems to achieve full autonomy. Several approaches have attempted to achieve MOSEVE using a LiDAR sensor. However, LiDAR sensors are typically expensive and susceptible to adverse weather conditions. Instead, millimeter-wave radar (MWR) has gained popularity in robotics and autonomous driving for real applications due to its cost-effectiveness and resilience to bad weather. Nonetheless, publicly available MOSEVE datasets and approaches using radar data are limited. Some existing methods adopt point convolutional networks from LiDAR-based approaches, ignoring the specific artifacts and the valuable radial velocity information of radar measurements, leading to suboptimal performance. In this paper, we propose a novel transformer network that effectively addresses the sparsity and noise issues and leverages the radial velocity measurements of radar points using our devised radar self- and cross-attention mechanisms. Based on that, our method achieves accurate EVE of the robot and performs MOS using only radar data simultaneously. To thoroughly evaluate the MOSEVE performance of our method, we annotated the radar points in the public View-of-Delft (VoD) dataset and additionally constructed a new radar dataset in various environments. The experimental results demonstrate the superiority of our approach over existing state-of-the-art methods. The code is available at https://github.com/ORCAUboat/RadarMOSEVE. \ No newline at end of file diff --git a/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning b/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning new file mode 100644 index 0000000000..a3fd87e2e1 --- /dev/null +++ b/data/2024/aaai/ReGCL: Rethinking Message Passing in Graph Contrastive Learning @@ -0,0 +1 @@ +Graph contrastive learning (GCL) has demonstrated remarkable efficacy in graph representation learning. However, previous studies have overlooked the inherent conflict that arises when employing graph neural networks (GNNs) as encoders for node-level contrastive learning. This conflict pertains to the partial incongruity between the feature aggregation mechanism of graph neural networks and the embedding distinction characteristic of contrastive learning. Theoretically, to investigate the location and extent of the conflict, we analyze the participation of message-passing from the gradient perspective of InfoNCE loss. Different from contrastive learning in other domains, the conflict in GCL arises due to the presence of certain samples that contribute to both the gradients of positive and negative simultaneously under the manner of message passing, which are opposite optimization directions. To further address the conflict issue, we propose a practical framework called ReGCL, which utilizes theoretical findings of GCL gradients to effectively improve graph contrastive learning. Specifically, two gradient-based strategies are devised in terms of both message passing and loss function to mitigate the conflict. Firstly, a gradient-guided structure learning method is proposed in order to acquire a structure that is adapted to contrastive learning principles. Secondly, a gradient-weighted InfoNCE loss function is designed to reduce the impact of false negative samples with high probabilities, specifically from the standpoint of the graph encoder. Extensive experiments demonstrate the superiority of the proposed method in comparison to state-of-the-art baselines across various node classification benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges b/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges new file mode 100644 index 0000000000..6ceeb0f411 --- /dev/null +++ b/data/2024/aaai/Reachability of Fair Allocations via Sequential Exchanges @@ -0,0 +1 @@ +In the allocation of indivisible goods, a prominent fairness notion is envy-freeness up to one good (EF1). We initiate the study of reachability problems in fair division by investigating the problem of whether one EF1 allocation can be reached from another EF1 allocation via a sequence of exchanges such that every intermediate allocation is also EF1. We show that two EF1 allocations may not be reachable from each other even in the case of two agents, and deciding their reachability is PSPACE-complete in general. On the other hand, we prove that reachability is guaranteed for two agents with identical or binary utilities as well as for any number of agents with identical binary utilities. We also examine the complexity of deciding whether there is an EF1 exchange sequence that is optimal in the number of exchanges required. \ No newline at end of file diff --git a/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents b/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents new file mode 100644 index 0000000000..68f3dcfa04 --- /dev/null +++ b/data/2024/aaai/Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents @@ -0,0 +1 @@ +The cursive written text is still representing a challenge for researchers. Latin and Chinese Optical Character Recognition systems (OCR) have been studied extensively in the literature. Yet little work was performed for Arabic character recognition. Powerful and stable text segmentation is still needed. In this paper, a segmentation technique which is capable of processing vowelized Arabic text is introduced. Such a technique is size and font independent. Moreover, it does not require the detection of the centerline. It can also process typeset text. This technique can segment the cursive written text line even vthe line suffers from skewness. HNTRODUCTION One of the most important characteristics of the Arabic text, written on a horizontal line, is that characters are connected together by a connection line called "centerline" as shown in fig. 1. Most of the previous and existing techniques [l-61 depend heavily on the detection of this line. As by deleting this centerline, the primitives forming the cursive text will be separated. Existing segmentation techniques for cursive written text are font and size dependent. The segmentation techniques differ according to whether the font is typewritten, typesetting or handwritten. However, a new technique is presented in this paper that does not require the detection of the centerline. Evermore, it is size and font independent. This technique can segment the cursive written text line even if the line suffers from skewness. ILARABIC TEXT SEGMENTATION Due to its cursive nature, the Arabic text needs to be segmented before the recognition , phase in most Arabic OCR. This segmentation needs to be accurate and stable for different sizes and fonts. Any error in segmentation will propagate in the recognition phase. Some research in the Arabic text segmentation used thresholding of the word histogram to detect and eliminate the connection part between two consecutive characters [7, 81. Other [9, 101 used thresholding on the word outer contour rather than the histogram. However the use of thresholding in text segmentation needs a-priori information about the average size of the character in the page in order to determine a threshold value. This will not work for Omni size character recognition where different character sizes may exist on the same text line. Also it is found [4] that at the connection part between the characters have different widths for different Arabic fonts. In the following subsections some details about such techniques will be given. A. Second Moment Segmentation Technique The first step in this technique is to detect and isolate the different lines in a given text. Once the line is isolated, a moment histogram is generated. By choosing the right threshold, the histogram is partitioned into segments. In the case of Arabic written text, the resulting segments correspond to the primitives in the Arabic word. Calculation of the Centerline Consider a region that includes text only, the centerline is normally the row that contains the maximum number of black pixels as shown in fig.2. However, the row containing the maximum number of black pixels does not always correspond to the centerline especially if the line is skewed. The second moment histogram Generally, the centerline divides the Arabic word into two parts. The second moment of each pixel above or below the centerline is computed as shown in fig.3. The Segmentation Threshold The segmentation of the Arabic word using the second moment histogram depends on choosing the right threshold. Generally, it is not possible to choose a fixed threshold to segment all Arabic words because it is font and size dependent. Figure 3 shows an example of a word with different sizes and the corresponding second moment histogram. Figure 4 shows another example of three words of different fonts and again the difference in the threshold is clear. B. Contour Segmentation Technique The Contour segmentation algorithm [4] starts with word contour tracing:Figure 5 shows an Arabic word after the elimination of all internal black pixels. It also shows that the connection lines are formed of only two lines (i.e. the columns of the connection line will be formed of only two pixels). However, other columns inside the primitives and 0-7803-8575-6/04/$20.00 02004 IEEE 412 not part of the connection line may contain only two pixels after the elimination of internal pixels. This problem of detecting only columns having two pixels and belonging to the connection line is considered in the sequel. The contour tracing operation leaves the connection line columns with only two pixels. Then it is required to remove this part. To achieve that, the thickness of the text line is divided into three regions as shown in fig. 6. The height of each region depends on the size of the font. The connection parts in the text should lye in the middle region, and hence any column containing two pixels outside this region is not removed. The second moment technique suffers from the fact that it needs pre-information about the font size and type. The threshold value should be adjusted to cope with the size and type of the text to be segmented. Otherwise, it requires a trial-and-error procedure to find the suitable threshold. This technique will also fail when considering different font types and sizes on the same line and also if the line is skewed. Furthermore, the contour segmentation technique suffers from dependence on the detection of the centerline. Moreover, it needs special enhancement to handle vowelized text. These factors justifi the need for another segmentation technique that is type and size independent (i.e. Omni) and is independent of the centerline. Such a technique is described in the next section. 1II.The Centerline Independent Technique The new technique is based on the detection of upwards spikes present in the written text. It is composed of scanning each line from right to left and segmenting each line into isolated regions while giving each region an index as shown in fig 7. In the Arabic cursive text, the developed regions can be clustered into four types: *Isolated Characters, *Isolated Diacritics and Hamza, *Isolated vowelization marks, *Isolated Sub words (a whole word o a part of it). A specific filter is applied during the scanning to detect the upwards spikes in the sub words results into further dividing these sub words into primitives. Each primitive will be given a different index as shown in fig. 7 and fig. 8. IV.CONCLUSIONS In this work, a new segmentation technique was presented. It does not depend on the centerline detection. Preliminary experiments showed promising results when processing either vowelized Arabic or typeset text (as shown in fig Band 10). This technique is able to segment text lines' into the corresponding primitives independently from the text font and size. It works even if more that one character set is present in the text line and can also tolerate line skewness. Figure 11 shows promising results when applied to handwritten Arabic text. The work incorporated in this paper, can be integrated within different recognition techniques either using group classification [ 1 11, neutral network [ 121, HMM [13] or any other recognition system to build a complete Arabic OCR.' V . REFERENCE S [ I ] S. Mori, C. Y. Suen, and K. Yamamoto, "Historical review of ocr research and development," Proceedings of the IEEE 80(7), pp. 1029-1058, 1992. [2] A. Amin, "Off-line arabic character recognition: The state of the art," Pattern Recognition 31(5), pp. 517530, 1998. [3] S. M. Yamany, "A complete analysis of an arabic text reader," Master's thesis, Systems and Biomedical Dept., Faculty of Eng., Cairo Univ., Egypt, Mar 1995. [4] M. A. Hashsish, A. R. El-Bialy, A. H. Kandil, and S. M. Yamany, "A novel segmentation technique for cursive written text," Proc. AI-Azhar Eng. 4th Int. Conf 1995. [5] M. A. Hashsish, A. R. El-Bialy, A. H. Kandil, and S. M. Yamany, "Topological features: Towards an omni arabic text reader," Proc. AI-Azhar Eng. 5th Int. Conf 1997. [6] H. Abdelazim and M. A. Hashish, "Arabic reading machine," Proc. 10th National Computer Conf. , pp. 733740, 1988. Riyad, Saudi Arabia. [7] H. Abdelazim and M. A. Hashish, "Automatic recognition of arabic text," 10th Image/ITL conf. in IBM Toronto Lab. , 1987. Canada. [8] A. Amin and S. AI-Fedaghi, "Machine recognition of printed arabic text utilizing a natural language morphology," Int. J.-ManMachine Stud 35, pp. 769-788, 1991. [9] T. El-Sheikh and R. Guindi, "Computer recognition of arabic cursive script," Pattern Recognition 21, pp. 293-302, 1988. [IO] V. Margner, "Sarat-a system for the recognition of Arabic printed text," Proc. 11th Int. Conf on Pattern Recognition , pp. 561-564, 1992. [ I 11 A. El-Bialy, A. H. Kandil, M. Hashish and S. Yamany "Arabic OCR: Twoard A Complete System," Document Recognition and Retrieval W, SPIE Vol. 3967, pp 42-51, Jan. 2000 [12] A. Amin and H. AI-Sadoun, "Handprinted arabic character recognition system using an artificial neural network," Pattern Recognition 29, pp. 663-675, 1996. [I31 L. Zhidong, I. Bazzi, A. Kornai, J. Makhoul, P. Natarajan and R. Schwartz "A Robust, Language Independent OCR," AIPR 98, SPIE Vol. 33584, pp 96-104, Oct. 1998. \ No newline at end of file diff --git a/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration b/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration new file mode 100644 index 0000000000..a8807518f2 --- /dev/null +++ b/data/2024/aaai/Real3D: The Curious Case of Neural Scene Degeneration @@ -0,0 +1,5 @@ +Despite significant progress in utilizing pre-trained text-to-image diffusion models to guide the creation of 3D scenes, these methods often struggle to generate scenes that are sufficiently realistic, leading to "neural scene degeneration". +In this work, we propose a new 3D scene generation model called Real3D. +Specifically, Real3D designs a pipeline from a NeRF-like implicit renderer to a tetrahedrons-based explicit renderer, greatly improving the neural network's ability to generate various neural scenes. +Moreover, Real3D introduces an additional discriminator to prevent neural scenes from falling into undesirable local optima, thus avoiding the degeneration phenomenon. +Our experimental results demonstrate that Real3D outperforms all existing state-of-the-art text-to-3D generation methods, providing valuable insights to facilitate the development of learning-based 3D scene generation approaches. \ No newline at end of file diff --git a/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint) b/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint) new file mode 100644 index 0000000000..69e60ab3a3 --- /dev/null +++ b/data/2024/aaai/Reasoning about Causality in Games (Abstract Reprint) @@ -0,0 +1,11 @@ +Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both these forms of reasoning has, until now, been lacking. We offer a solution in the form of (structural) causal games, which can be seen as extending Pearl's causal hierarchy to the game-theoretic domain, or as extending Koller and Milch's multi-agent influence diagrams to the causal domain. We then consider three key questions: +i) +How can the (causal) dependencies in games – either between variables, or between strategies – be modelled in a uniform, principled manner? + +ii) +How may causal queries be computed in causal games, and what assumptions does this require? + +iii) +How do causal games compare to existing formalisms? + +To address question i), we introduce mechanised games, which encode dependencies between agents' decision rules and the distributions governing the game. In response to question ii), we present definitions of predictions, interventions, and counterfactuals, and discuss the assumptions required for each. Regarding question iii), we describe correspondences between causal games and other formalisms, and explain how causal games can be used to answer queries that other causal or game-theoretic models do not support. Finally, we highlight possible applications of causal games, aided by an extensive open-source Python library. \ No newline at end of file diff --git a/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface b/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface new file mode 100644 index 0000000000..544f78602c --- /dev/null +++ b/data/2024/aaai/RecWizard: A Toolkit for Conversational Recommendation with Modular, Portable Models and Interactive User Interface @@ -0,0 +1 @@ +We present a new Python toolkit called RecWizard for Conversational Recommender Systems (CRS). RecWizard offers support for development of models and interactive user interface, drawing from the best practices of the Huggingface ecosystems. CRS with RecWizard are modular, portable, interactive and Large Language Models (LLMs)-friendly, to streamline the learning process and reduce the additional effort for CRS research. For more comprehensive information about RecWizard, please check our GitHub https://github.com/McAuley-Lab/RecWizard. \ No newline at end of file diff --git a/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model b/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model new file mode 100644 index 0000000000..1f45caeb6d --- /dev/null +++ b/data/2024/aaai/Recall-Oriented Continual Learning with Generative Adversarial Meta-Model @@ -0,0 +1 @@ +The stability-plasticity dilemma is a major challenge in continual learning, as it involves balancing the conflicting objectives of maintaining performance on previous tasks while learning new tasks. In this paper, we propose the recalloriented continual learning framework to address this challenge. Inspired by the human brain’s ability to separate the mechanisms responsible for stability and plasticity, our framework consists of a two-level architecture where an inference network effectively acquires new knowledge and a generative network recalls past knowledge when necessary. In particular, to maximize the stability of past knowledge, we investigate the complexity of knowledge depending on different representations, and thereby introducing generative adversarial meta-model (GAMM) that incrementally learns task-specific parameters instead of input data samples of the task. Through our experiments, we show that our framework not only effectively learns new knowledge without any disruption but also achieves high stability of previous knowledge in both task-aware and task-agnostic learning scenarios. Our code is available at: https://github.com/bigdata-inha/recall-orientedcl-framework. \ No newline at end of file diff --git a/data/2024/aaai/Recasting Regional Lighting for Shadow Removal b/data/2024/aaai/Recasting Regional Lighting for Shadow Removal new file mode 100644 index 0000000000..ec7d5f786d --- /dev/null +++ b/data/2024/aaai/Recasting Regional Lighting for Shadow Removal @@ -0,0 +1,2 @@ +Removing shadows requires an understanding of both lighting conditions and object textures in a scene. Existing methods typically learn pixel-level color mappings between +shadow and non-shadow images, in which the joint modeling of lighting and object textures is implicit and inadequate. We observe that in a shadow region, the degradation degree of object textures depends on the local illumination, while simply enhancing the local illumination cannot fully recover the attenuated textures. Based on this observation, we propose to condition the restoration of attenuated textures on the corrected local lighting in the shadow region. Specifically, We first design a shadow-aware decomposition network to estimate the illumination and reflectance layers of shadow regions explicitly. We then propose a novel bilateral correction network to recast the lighting of shadow regions in the illumination layer via a novel local lighting correction module, and to restore the textures conditioned on the corrected illumination layer via a novel illumination-guided texture restoration module. We further annotate pixel-wise shadow masks for the public SRD dataset, which originally contains only image pairs. Experiments on three benchmarks show that our method outperforms existing state-of-the-art shadow removal methods. Project page in: yuhaoliu7456.github.io/RRL-Net. \ No newline at end of file diff --git a/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning b/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning new file mode 100644 index 0000000000..5ce7fabd49 --- /dev/null +++ b/data/2024/aaai/Recent Advancements in Inverse Reinforcement Learning @@ -0,0 +1,9 @@ +Inverse reinforcement learning (IRL) has seen significant advancements in recent years. This class of approaches aims to efficiently learn the underlying reward function that rationalizes the behavior exhibited by expert agents, often represented by humans. In contrast to mere behavioral cloning, the reconstruction of a reward function yields appealing implications, as it allows for more effective interpretability of the expert’s decisions and provides a transferable specification of the expert’s objectives for application in even different environments. Unlike the well-understood field of reinforcement learning (RL) from a theoretical perspective, IRL still grapples with limited understanding, significantly constraining its applicability. A fundamental challenge in IRL is the inherent ambiguity in selecting a reward function, given the existence of multiple candidate functions, all explaining the expert’s behavior. + +In this talk, I will survey three of my papers that have made notable contributions to the IRL field: “Provably Efficient Learning of Transferable Rewards”, “Towards Theoretical Understanding of Inverse Reinforcement Learning”, and “Inverse Reinforcement Learning with Sub-optimal Experts". + +The central innovation introduced by the first paper is a novel formulation of the IRL problem that overcomes the issue of ambiguity. IRL is reframed as the problem of learning the feasible reward set, which is the set of all rewards that can explain the expert’s behavior. This approach postpones the selection of the reward function, thereby circumventing the ambiguity issues. Furthermore, the feasible reward set exhibits convenient geometric properties that enable the development of efficient algorithms for its computation. + +Building on this novel formulation of IRL, the second paper addresses the problem of efficiently learning the feasible reward set when the environment and the expert’s policy are not known in advance. It introduces a novel way to assess the dissimilarity between feasible reward sets based on the Hausdorff distance and presents a new PAC (probabilistic approximately correct) framework. The most significant contribution of this paper is the introduction of the first sample complexity lower bound, which highlights the challenges inherent in the IRL problem. Deriving this lower bound necessitated the development of novel technical tools. The paper also demonstrates that when a generative model of the environment is available, a uniform sampling strategy achieves a sample complexity that matches the lower bound, up to logarithmic factors. + +Finally, in the third paper, the IRL problem in the presence of sub-optimal experts is investigated. Specifically, the paper assumes the availability of multiple sub-optimal experts, in addition to the expert agent, which provides additional demonstrations, associated with a known quantification of the maximum amount of sub-optimality. The paper shows that this richer information mitigates the ambiguity problem, significantly reducing the size of the feasible reward set while retaining its favorable geometric properties. Furthermore, the paper explores the associated statistical problem and derives novel lower bounds for sample complexity, along with almost matching algorithms. These selected papers represent notable advancements in IRL, contributing to the establishment of a solid theoretical foundation for IRL and extending the framework to accommodate scenarios with sub-optimal experts. \ No newline at end of file diff --git a/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera b/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera new file mode 100644 index 0000000000..3e2f419716 --- /dev/null +++ b/data/2024/aaai/Recognizing Ultra-High-Speed Moving Objects with Bio-Inspired Spike Camera @@ -0,0 +1 @@ +Bio-inspired spike camera mimics the sampling principle of primate fovea. It presents high temporal resolution and dynamic range, showing great promise in fast-moving object recognition. However, the physical limit of CMOS technology in spike cameras still hinders their capability of recognizing ultra-high-speed moving objects, e.g., extremely fast motions cause blur during the imaging process of spike cameras. This paper presents the first theoretical analysis for the causes of spiking motion blur and proposes a robust representation that addresses this issue through temporal-spatial context learning. The proposed method leverages multi-span feature aggregation to capture temporal cues and employs residual deformable convolution to model spatial correlation among neighbouring pixels. Additionally, this paper contributes an original real-captured spiking recognition dataset consisting of 12,000 ultra-high-speed (equivalent speed > 500 km/h) moving objects. Experimental results show that the proposed method achieves 73.2% accuracy in recognizing 10 classes of ultra-high-speed moving objects, outperforming all existing spike-based recognition methods. Resources will be available at https://github.com/Evin-X/UHSR. \ No newline at end of file diff --git a/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization b/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization new file mode 100644 index 0000000000..eea62a1fad --- /dev/null +++ b/data/2024/aaai/Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization @@ -0,0 +1 @@ +Modern recommender systems lie at the heart of complex recommender ecosystems that couple the behavior of users, content providers, vendors, advertisers, and other actors. Despite this, the focus of much recommender systems research and deployment is on the local, myopic optimization of the recommendations made to individual users. This comes at a significant cost to the long-term utility that recommender systems generate for their users. We argue that modeling the incentives and behaviors of these actors, and the interactions among them induced by the recommender systems, is needed to maximize value and improve overall ecosystem health. Moreover, we propose the use of economic mechanism design, an area largely overlooked in recommender systems research, as a framework for developing such models. That said, one cannot apply “vanilla” mechanism design to recommender ecosystem modeling optimization out of the box—the use of mechanism design raises a number of subtle and interesting research challenges. We outline a number of these in this talk (and paper), emphasizing the need to develop nonstandard approaches to mechanism design that intersect with numerous areas of research, including preference modeling, reinforcement learning and exploration, behavioral economics, and generative AI, among others. \ No newline at end of file diff --git a/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach b/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach new file mode 100644 index 0000000000..4b3b23b0f6 --- /dev/null +++ b/data/2024/aaai/Reconciling Predictive and Statistical Parity: A Causal Approach @@ -0,0 +1 @@ +Since the rise of fair machine learning as a critical field of inquiry, many different notions on how to quantify and measure discrimination have been proposed in the literature. Some of these notions, however, were shown to be mutually incompatible. Such findings make it appear that numerous different kinds of fairness exist, thereby making a consensus on the appropriate measure of fairness harder to reach, hindering the applications of these tools in practice. In this paper, we investigate one of these key impossibility results that relates the notions of statistical and predictive parity. Specifically, we derive a new causal decomposition formula for the fairness measures associated with predictive parity, and obtain a novel insight into how this criterion is related to statistical parity through the legal doctrines of disparate treatment, disparate impact, and the notion of business necessity. Our results show that through a more careful causal analysis, the notions of statistical and predictive parity are not really mutually exclusive, but complementary and spanning a spectrum of fairness notions through the concept of business necessity. Finally, we demonstrate the importance of our findings on a real-world example. \ No newline at end of file diff --git a/data/2024/aaai/Rectangle Search: An Anytime Beam Search b/data/2024/aaai/Rectangle Search: An Anytime Beam Search new file mode 100644 index 0000000000..6e1b7c21e1 --- /dev/null +++ b/data/2024/aaai/Rectangle Search: An Anytime Beam Search @@ -0,0 +1 @@ +Anytime heuristic search algorithms try to find a (potentially suboptimal) solution as quickly as possible and then work to find better and better solutions until an optimal solution is obtained or time is exhausted. The most widely-known anytime search algorithms are based on best-first search. In this paper, we propose a new algorithm, rectangle search, that is instead based on beam search, a variant of breadth-first search. It repeatedly explores alternatives at all depth levels and is thus best-suited to problems featuring deep local minima. Experiments using a variety of popular search benchmarks suggest that rectangle search is competitive with fixed-width beam search and often performs better than the previous best anytime search algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic b/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic new file mode 100644 index 0000000000..6326a622e0 --- /dev/null +++ b/data/2024/aaai/Recurrent Graph Neural Networks and Their Connections to Bisimulation and Logic @@ -0,0 +1 @@ +The success of Graph Neural Networks (GNNs) in practice has motivated extensive research on their theoretical properties. This includes recent results that characterise node classifiers expressible by GNNs in terms of first order logic. Most of the analysis, however, has been focused on GNNs with fixed number of message-passing iterations (i.e., layers), which cannot realise many simple classifiers such as reachability of a node with a given label. In this paper, we start to fill this gap and study the foundations of GNNs that can perform more than a fixed number of message-passing iterations. We first formalise two generalisations of the basic GNNs: recurrent GNNs (RecGNNs), which repeatedly apply message-passing iterations until the node classifications become stable, and graph-size GNNs (GSGNNs), which exploit a built-in function of the input graph size to decide the number of message-passings. We then formally prove that GNN classifiers are strictly less expressive than RecGNN ones, and RecGNN classifiers are strictly less expressive than GSGNN ones. To get this result, we identify novel semantic characterisations of the three formalisms in terms of suitable variants of bisimulation, which we believe have their own value for our understanding of GNNs. Finally, we prove syntactic logical characterisations of RecGNNs and GSGNNs analogous to the logical characterisation of plain GNNs, where we connect the two formalisms to monadic monotone fixpoint logic---a generalisation of first-order logic that supports recursion. \ No newline at end of file diff --git a/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation b/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation new file mode 100644 index 0000000000..dfced8c69c --- /dev/null +++ b/data/2024/aaai/Recurrent Partial Kernel Network for Efficient Optical Flow Estimation @@ -0,0 +1 @@ +Optical flow estimation is a challenging task consisting of predicting per-pixel motion vectors between images. Recent methods have employed larger and more complex models to improve the estimation accuracy. However, this impacts the widespread adoption of optical flow methods and makes it harder to train more general models since the optical flow data is hard to obtain. This paper proposes a small and efficient model for optical flow estimation. We design a new spatial recurrent encoder that extracts discriminative features at a significantly reduced size. Unlike standard recurrent units, we utilize Partial Kernel Convolution (PKConv) layers to produce variable multi-scale features with a single shared block. We also design efficient Separable Large Kernels (SLK) to capture large context information with low computational cost. Experiments on public benchmarks show that we achieve state-of-the-art generalization performance while requiring significantly fewer parameters and memory than competing methods. Our model ranks first in the Spring benchmark without finetuning, improving the results by over 10% while requiring an order of magnitude fewer FLOPs and over four times less memory than the following published method without finetuning. The code is available at github.com/hmorimitsu/ptlflow/tree/main/ptlflow/models/rpknet. \ No newline at end of file diff --git a/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates b/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates new file mode 100644 index 0000000000..805eb8fd49 --- /dev/null +++ b/data/2024/aaai/RedCore: Relative Advantage Aware Cross-Modal Representation Learning for Missing Modalities with Imbalanced Missing Rates @@ -0,0 +1 @@ +Multimodal learning is susceptible to modality missing, which poses a major obstacle for its practical applications and, thus, invigorates increasing research interest. In this paper, we investigate two challenging problems: 1) when modality missing exists in the training data, how to exploit the incomplete samples while guaranteeing that they are properly supervised? 2) when the missing rates of different modalities vary, causing or exacerbating the imbalance among modalities, how to address the imbalance and ensure all modalities are well-trained. To tackle these two challenges, we first introduce the variational information bottleneck (VIB) method for the cross-modal representation learning of missing modalities, which capitalizes on the available modalities and the labels as supervision. Then, accounting for the imbalanced missing rates, we define relative advantage to quantify the advantage of each modality over others. Accordingly, a bi-level optimization problem is formulated to adaptively regulate the supervision of all modalities during training. As a whole, the proposed approach features Relative advantage aware Cross-modal representation learning (abbreviated as RedCore) for missing modalities with imbalanced missing rates. Extensive empirical results demonstrate that RedCore outperforms competing models in that it exhibits superior robustness against either large or imbalanced missing rates. The code is available at: https://github.com/sunjunaimer/RedCore. \ No newline at end of file diff --git a/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks b/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks new file mode 100644 index 0000000000..f33466b81b --- /dev/null +++ b/data/2024/aaai/Redefining ABA+ Semantics via Abstract Set-to-Set Attacks @@ -0,0 +1 @@ +Assumption-based argumentation (ABA) is a powerful defeasible reasoning formalism which is based on the interplay of assumptions, their contraries, and inference rules. ABA with preferences (ABA+) generalizes the basic model by allowing qualitative comparison between assumptions. The integration of preferences however comes with a cost. In ABA+, the evaluation under two central and well-established semantics---grounded and complete semantics---is not guaranteed to yield an outcome. Moreover, while ABA frameworks without preferences allow for a graph-based representation in Dung-style frameworks, an according instantiation for general ABA+ frameworks has not been established so far. In this work, we tackle both issues: First, we develop a novel abstract argumentation formalism based on set-to-set attacks. We show that our so-called Hyper Argumentation Frameworks (HYPAFs) capture ABA+. Second, we propose relaxed variants of complete and grounded semantics for HYPAFs that yield an extension for all frameworks by design, while still faithfully generalizing the established semantics of Dung-style Argumentation Frameworks. We exploit the newly established correspondence between ABA+ and HYPAFs to obtain variants for grounded and complete ABA+ semantics that are guaranteed to yield an outcome. Finally, we discuss basic properties and provide a complexity analysis. Along the way, we settle the computational complexity of several ABA+ semantics. \ No newline at end of file diff --git a/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages b/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages new file mode 100644 index 0000000000..ddc2ccc112 --- /dev/null +++ b/data/2024/aaai/Redefining the Laparoscopic Spatial Sense: AI-Based Intra- and Postoperative Measurement from Stereoimages @@ -0,0 +1 @@ +A significant challenge in image-guided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures. \ No newline at end of file diff --git a/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models b/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models new file mode 100644 index 0000000000..1999851dfa --- /dev/null +++ b/data/2024/aaai/Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models @@ -0,0 +1 @@ +Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model in the distillation. Accordingly, we propose Spatial Fitting-Error Reduction Distillation model (SFERD). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64x64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models. \ No newline at end of file diff --git a/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation b/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation new file mode 100644 index 0000000000..d711c9700c --- /dev/null +++ b/data/2024/aaai/Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation @@ -0,0 +1,5 @@ +Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. +However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. +Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. +Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. +On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR. \ No newline at end of file diff --git a/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules b/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules new file mode 100644 index 0000000000..7730a253f3 --- /dev/null +++ b/data/2024/aaai/Refined Characterizations of Approval-Based Committee Scoring Rules @@ -0,0 +1 @@ +In approval-based committee (ABC) elections, the goal is to select a fixed-size subset of the candidates, a so-called committee, based on the voters' approval ballots over the candidates. One of the most popular classes of ABC voting rules are ABC scoring rules, for which voters give points to each committee and the committees with maximal total points are chosen. While the set of ABC scoring rules has recently been characterized in a model where the output is a ranking of all committees, no full characterization of these rules exists in the standard model where a set of winning committees is returned. We address this issue by characterizing two important subclasses of ABC scoring rules in the standard ABC election model, thereby both extending the result for ABC ranking rules to the standard setting and refining it to subclasses. In more detail, by relying on a consistency axiom for variable electorates, we characterize (i) the prominent class of Thiele rules and (ii) a new class of ABC voting rules called ballot size weighted approval voting. Based on these theorems, we also infer characterizations of three well-known ABC voting rules, namely multi-winner approval voting, proportional approval voting, and satisfaction approval voting. \ No newline at end of file diff --git a/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks b/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks new file mode 100644 index 0000000000..8d9e86d84f --- /dev/null +++ b/data/2024/aaai/Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks @@ -0,0 +1 @@ +Graph convolution networks (GCNs) are extensively utilized in various graph tasks to mine knowledge from spatial data. Our study marks the pioneering attempt to quantitatively investigate the GCN robustness over omnipresent heterophilic graphs for node classification. We uncover that the predominant vulnerability is caused by the structural out-of-distribution (OOD) issue. This finding motivates us to present a novel method that aims to harden GCNs by automatically learning Latent Homophilic Structures over heterophilic graphs. We term such a methodology as LHS. To elaborate, our initial step involves learning a latent structure by employing a novel self-expressive technique based on multi-node interactions. Subsequently, the structure is refined using a pairwisely constrained dual-view contrastive learning approach. We iteratively perform the above procedure, enabling a GCN model to aggregate information in a homophilic way on heterophilic graphs. Armed with such an adaptable structure, we can properly mitigate the structural OOD threats over heterophilic graphs. Experiments on various benchmarks show the effectiveness of the proposed LHS approach for robust GCNs. \ No newline at end of file diff --git a/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction b/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction new file mode 100644 index 0000000000..de7158473a --- /dev/null +++ b/data/2024/aaai/Region-Aware Exposure Consistency Network for Mixed Exposure Correction @@ -0,0 +1 @@ +Exposure correction aims to enhance images suffering from improper exposure to achieve satisfactory visual effects. Despite recent progress, existing methods generally mitigate either overexposure or underexposure in input images, and they still struggle to handle images with mixed exposure, i.e., one image incorporates both overexposed and underexposed regions. The mixed exposure distribution is non-uniform and leads to varying representation, which makes it challenging to address in a unified process. In this paper, we introduce an effective Region-aware Exposure Correction Network (RECNet) that can handle mixed exposure by adaptively learning and bridging different regional exposure representations. Specifically, to address the challenge posed by mixed exposure disparities, we develop a region-aware de-exposure module that effectively translates regional features of mixed exposure scenarios into an exposure-invariant feature space. Simultaneously, as de-exposure operation inevitably reduces discriminative information, we introduce a mixed-scale restoration unit that integrates exposure-invariant features and unprocessed features to recover local information. To further achieve a uniform exposure distribution in the global image, we propose an exposure contrastive regularization strategy under the constraints of intra-regional exposure consistency and inter-regional exposure continuity. Extensive experiments are conducted on various datasets, and the experimental results demonstrate the superiority and generalization of our proposed method. The code is released at: https://github.com/kravrolens/RECNet. \ No newline at end of file diff --git a/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation b/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation new file mode 100644 index 0000000000..5c66eedb10 --- /dev/null +++ b/data/2024/aaai/Region-Disentangled Diffusion Model for High-Fidelity PPG-to-ECG Translation @@ -0,0 +1 @@ +The high prevalence of cardiovascular diseases (CVDs) calls for accessible and cost-effective continuous cardiac monitoring tools. Despite Electrocardiography (ECG) being the gold standard, continuous monitoring remains a challenge, leading to the exploration of Photoplethysmography (PPG), a promising but more basic alternative available in consumer wearables. This notion has recently spurred interest in translating PPG to ECG signals. In this work, we introduce Region-Disentangled Diffusion Model (RDDM), a novel diffusion model designed to capture the complex temporal dynamics of ECG. Traditional Diffusion models like Denoising Diffusion Probabilistic Models (DDPM) face challenges in capturing such nuances due to the indiscriminate noise addition process across the entire signal. Our proposed RDDM overcomes such limitations by incorporating a novel forward process that selectively adds noise to specific regions of interest (ROI) such as QRS complex in ECG signals, and a reverse process that disentangles the denoising of ROI and non-ROI regions. Quantitative experiments demonstrate that RDDM can generate high-fidelity ECG from PPG in as few as 10 diffusion steps, making it highly effective and computationally efficient. Additionally, to rigorously validate the usefulness of the generated ECG signals, we introduce CardioBench, a comprehensive evaluation benchmark for a variety of cardiac-related tasks including heart rate and blood pressure estimation, stress classification, and the detection of atrial fibrillation and diabetes. Our thorough experiments show that RDDM achieves state-of-the-art performance on CardioBench. To the best of our knowledge, RDDM is the first diffusion model for cross-modal signal-to-signal translation in the bio-signal domain. \ No newline at end of file diff --git a/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes b/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes new file mode 100644 index 0000000000..8f6b4a2b83 --- /dev/null +++ b/data/2024/aaai/Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes @@ -0,0 +1 @@ +In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a vanilla policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has O(T^3/4) regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Regret Analysis of Repeated Delegated Choice b/data/2024/aaai/Regret Analysis of Repeated Delegated Choice new file mode 100644 index 0000000000..f178449ee1 --- /dev/null +++ b/data/2024/aaai/Regret Analysis of Repeated Delegated Choice @@ -0,0 +1 @@ +We present a study on a repeated delegated choice problem, which is the first to consider an online learning variant of Kleinberg and Kleinberg, EC'18. In this model, a principal interacts repeatedly with an agent who possesses an exogenous set of solutions to search for efficient ones. Each solution can yield varying utility for both the principal and the agent, and the agent may propose a solution to maximize its own utility in a selfish manner. To mitigate this behavior, the principal announces an eligible set which screens out a certain set of solutions. The principal, however, does not have any information on the distribution of solutions nor the number of solutions in advance. Therefore, the principal dynamically announces various eligible sets to efficiently learn the distribution. The principal's objective is to minimize cumulative regret compared to the optimal eligible set in hindsight. We explore two dimensions of the problem setup, whether the agent behaves myopically or strategizes across the rounds, and whether the solutions yield deterministic or stochastic utility. We obtain sublinear regret upper bounds in various regimes, and derive corresponding lower bounds which implies the tightness of the results. Overall, we bridge a well-known problem in economics to the evolving area of online learning, and present a comprehensive study in this problem. \ No newline at end of file diff --git a/data/2024/aaai/Regroup Median Loss for Combating Label Noise b/data/2024/aaai/Regroup Median Loss for Combating Label Noise new file mode 100644 index 0000000000..f2869a1d58 --- /dev/null +++ b/data/2024/aaai/Regroup Median Loss for Combating Label Noise @@ -0,0 +1 @@ +The deep model training procedure requires large-scale datasets of annotated data. Due to the difficulty of annotating a large number of samples, label noise caused by incorrect annotations is inevitable, resulting in low model performance and poor model generalization. To combat label noise, current methods usually select clean samples based on the small-loss criterion and use these samples for training. Due to some noisy samples similar to clean ones, these small-loss criterion-based methods are still affected by label noise. To address this issue, in this work, we propose Regroup Median Loss (RML) to reduce the probability of selecting noisy samples and correct losses of noisy samples. RML randomly selects samples with the same label as the training samples based on a new loss processing method. Then, we combine the stable mean loss and the robust median loss through a proposed regrouping strategy to obtain robust loss estimation for noisy samples. To further improve the model performance against label noise, we propose a new sample selection strategy and build a semi-supervised method based on RML. Compared to state-of-the-art methods, for both the traditionally trained and semi-supervised models, RML achieves a significant improvement on synthetic and complex real-world datasets. The source is at https://github.com/Feng-peng-Li/Regroup-Loss-Median-to-Combat-Label-Noise. \ No newline at end of file diff --git a/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act b/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act new file mode 100644 index 0000000000..568c47c21b --- /dev/null +++ b/data/2024/aaai/Regulating AI: Applying Insights from Behavioural Economics and Psychology to the Application of Article 5 of the EU AI Act @@ -0,0 +1 @@ +Article 5 of the European Union’s Artificial Intelligence Act is intended to regulate AI use to prevent potentially harmful consequences. Nevertheless, applying this legislation practically is likely to be challenging because of ambiguously used terminologies and because it fails to specify which manipulation techniques may be invoked by AI, potentially leading to significant harm. This paper aims to bridge this gap by defining key terms and demonstrating how AI may invoke these techniques, drawing from insights in psychology and behavioural economics. First, this paper provides definitions of the terms “subliminal techniques”, “manipulative techniques” and “deceptive techniques”. Secondly, we identified from the literature in cognitive psychology and behavioural economics three subliminal and five manipulative techniques and exemplify how AI might implement these techniques to manipulate users in real-world case scenarios. These illustrations may serve as a practical guide for stakeholders to detect cases of AI manipulation and consequently devise preventive measures. Article 5 has also been criticised for offering inadequate protection. We critically assess the protection offered by Article 5, proposing specific revisions to paragraph 1, points (a) and (b) of Article 5 to increase its protective effectiveness. \ No newline at end of file diff --git a/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving b/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving new file mode 100644 index 0000000000..306ee98472 --- /dev/null +++ b/data/2024/aaai/Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving @@ -0,0 +1 @@ +Multi-camera perception tasks have gained significant attention in the field of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot (LSS) in the multi-camera setting cannot produce suitable dense 3D features due to the projection nature and uncontrollable densification process. To resolve this problem, we propose to regulate intermediate dense 3D features with the help of volume rendering. Specifically, we employ volume rendering to process the dense 3D features to obtain corresponding 2D features (e.g., depth maps, semantic maps), which are supervised by associated labels in the training. This manner regulates the generation of dense 3D features on the feature level, providing appropriate dense and unified features for multiple perception tasks. Therefore, our approach is termed Vampire, stands for ``Volume rendering As Multi-camera Perception Intermediate feature REgulator''. Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features, and is competitive with existing SOTA methods across diverse downstream perception tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection detection, while utilizing moderate GPU resources. We provide a video demonstration in the supplementary materials and Codes are available at github.com/cskkxjk/Vampire. \ No newline at end of file diff --git a/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection b/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection new file mode 100644 index 0000000000..ece771fd6c --- /dev/null +++ b/data/2024/aaai/Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection @@ -0,0 +1 @@ +Nowadays, detecting multimodal fake news has emerged as a foremost concern since the widespread dissemination of fake news may incur adverse societal impact. Conventional methods generally focus on capturing the linguistic and visual semantics within the multimodal content, which fall short in effectively distinguishing the heightened level of meticulous fabrications. Recently, external knowledge is introduced to provide valuable background facts as complementary to facilitate news detection. Nevertheless, existing knowledge-enhanced endeavors directly incorporate all knowledge contexts through static entity embeddings, resulting in the potential noisy and content-irrelevant knowledge. Moreover, the integration of knowledge entities makes it intractable to model the sophisticated correlations between multimodal semantics and knowledge entities. In light of these limitations, we propose a novel Adaptive Knowledge-Aware Fake News Detection model, dubbed AKA-Fake. For each news, AKA-Fake learns a compact knowledge subgraph under a reinforcement learning paradigm, which consists of a subset of entities and contextual neighbors in the knowledge graph, restoring the most informative knowledge facts. A novel heterogeneous graph learning module is further proposed to capture the reliable cross-modality correlations via topology refinement and modality-attentive pooling. Our proposal is extensively evaluated over three popular datasets, and experimental results demonstrate the superiority of AKA-Fake. \ No newline at end of file diff --git a/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis b/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis new file mode 100644 index 0000000000..91661f6c6b --- /dev/null +++ b/data/2024/aaai/Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis @@ -0,0 +1 @@ +Program synthesis is the task of automatically generating code based on a specification. In Syntax-Guided Synthesis (SyGuS) this specification is a combination of a syntactic template and a logical formula, and the result is guaranteed to satisfy both. We present a reinforcement-learning guided algorithm for SyGuS which uses Monte-Carlo Tree Search (MCTS) to search the space of candidate solutions. Our algorithm learns policy and value functions which, combined with the upper confidence bound for trees, allow it to balance exploration and exploitation. A common challenge in applying machine learning approaches to syntax-guided synthesis is the scarcity of training data. To address this, we present a method for automatically generating training data for SyGuS based on anti-unification of existing first-order satisfiability problems, which we use to train our MCTS policy. We implement and evaluate this setup and demonstrate that learned policy and value improve the synthesis performance over a baseline by over 26 percentage points in the training and testing sets. Our tool outperforms state-of-the-art tool cvc5 on the training set and performs comparably in terms of the total number of problems solved on the testing set (solving 23% of the benchmarks on which cvc5 fails). We make our data set publicly available, to enable further application of machine learning methods to the SyGuS problem. \ No newline at end of file diff --git a/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation b/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation new file mode 100644 index 0000000000..cb91f76d6c --- /dev/null +++ b/data/2024/aaai/Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation @@ -0,0 +1 @@ +Deep learning architectures have achieved state-of-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such large architectures lead to increased accuracy, this is usually accompanied by a larger increase in computation and memory requirements during inference. While this is a non-issue in traditional machine learning (ML) pipelines, the recent confluence of machine learning and fields like the Internet of Things (IoT) has rendered such large architectures infeasible for execution in low-resource settings. For some datasets, large monolithic pipelines may be overkill for simpler inputs. To address this problem, previous efforts have proposed decision cascades where inputs are passed through models of increasing complexity until desired performance is achieved. However, we argue that cascaded prediction leads to sub-optimal throughput and increased computational cost due to wasteful intermediate computations. To address this, we propose PaSeR (Parsimonious Segmentation with Reinforcement Learning) a non-cascading, cost-aware learning pipeline as an efficient alternative to cascaded decision architectures. Through experimental evaluation on both real-world and standard datasets, we demonstrate that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. Further, we introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance. On the real-world task of battery material phase segmentation, PaSeR yields a minimum performance improvement of 174% on the IoU/GigaFlop metric with respect to baselines. We also demonstrate PaSeR's adaptability to complementary models trained on a noisy MNIST dataset, where it achieved a minimum performance improvement on IoU/GigaFlop of 13.4% over SOTA models. Code and data are available at github.com/scailab/paser. \ No newline at end of file diff --git a/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs b/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs new file mode 100644 index 0000000000..d8ec28e20e --- /dev/null +++ b/data/2024/aaai/Relational Distant Supervision for Image Captioning without Image-Text Pairs @@ -0,0 +1 @@ +Unsupervised image captioning aims to generate descriptions of images without relying on any image-sentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (1) a relationship learning module that learns to infer relationships from images under the distant supervision; (2) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (3) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning. \ No newline at end of file diff --git a/data/2024/aaai/Relational Programming with Foundational Models b/data/2024/aaai/Relational Programming with Foundational Models new file mode 100644 index 0000000000..ba6b3e72a6 --- /dev/null +++ b/data/2024/aaai/Relational Programming with Foundational Models @@ -0,0 +1 @@ +Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as in-context learning, information retrieval, and code interpreting. We propose Vieira, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. Vieira follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement Vieira by extending the Scallop compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate Vieira on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in Vieira are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines. \ No newline at end of file diff --git a/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer b/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer new file mode 100644 index 0000000000..b1678da602 --- /dev/null +++ b/data/2024/aaai/Relative Policy-Transition Optimization for Fast Policy Transfer @@ -0,0 +1 @@ +We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics. \ No newline at end of file diff --git a/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization b/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization new file mode 100644 index 0000000000..34a8cc3acf --- /dev/null +++ b/data/2024/aaai/Relaxed Stationary Distribution Correction Estimation for Improved Offline Policy Optimization @@ -0,0 +1 @@ +One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with f-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the sub-optimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin. \ No newline at end of file diff --git a/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation b/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation new file mode 100644 index 0000000000..38d0bc3f1f --- /dev/null +++ b/data/2024/aaai/Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation @@ -0,0 +1 @@ +For few-shot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the semantic consistency of foreground instances, we propose an unlabeled branch as an efficient data utilization method, which teaches the model how to extract intrinsic features robust to intra-class differences. Notably, during testing, the proposed unlabeled branch is excluded without extra unlabeled data and computation. Furthermore, we extend the inter-class variability between foreground and background by proposing a novel multi-level prototype generation and interaction module. The different-grained complementarity between global and local prototypes allows for better distinction between similar categories. The qualitative and quantitative performance of RiFeNet surpasses the state-of-the-art methods on PASCAL-5i and COCO benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Reliable Conflictive Multi-View Learning b/data/2024/aaai/Reliable Conflictive Multi-View Learning new file mode 100644 index 0000000000..b0c9235c6d --- /dev/null +++ b/data/2024/aaai/Reliable Conflictive Multi-View Learning @@ -0,0 +1 @@ +Multi-view learning aims to combine multiple features to achieve more comprehensive descriptions of data. Most previous works assume that multiple views are strictly aligned. However, real-world multi-view data may contain low-quality conflictive instances, which show conflictive information in different views. Previous methods for this problem mainly focus on eliminating the conflictive data instances by removing them or replacing conflictive views. Nevertheless, real-world applications usually require making decisions for conflictive instances rather than only eliminating them. To solve this, we point out a new Reliable Conflictive Multi-view Learning (RCML) problem, which requires the model to provide decision results and attached reliabilities for conflictive multi-view data. We develop an Evidential Conflictive Multi-view Learning (ECML) method for this problem. ECML first learns view-specific evidence, which could be termed as the amount of support to each category collected from data. Then, we can construct view-specific opinions consisting of decision results and reliability. In the multi-view fusion stage, we propose a conflictive opinion aggregation strategy and theoretically prove this strategy can exactly model the relation of multi-view common and view-specific reliabilities. Experiments performed on 6 datasets verify the effectiveness of ECML. The code is released at https://github.com/jiajunsi/RCML. \ No newline at end of file diff --git a/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction b/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction new file mode 100644 index 0000000000..10b83ada60 --- /dev/null +++ b/data/2024/aaai/Reliable Data Generation and Selection for Low-Resource Relation Extraction @@ -0,0 +1 @@ +Automated construction of annotated data holds significant importance in Relation Extraction (RE) tasks due to the hardness and cost of human annotation. In this work, we propose Self-RDGS, a method for Self-supervised Reliable Data Generation and Selection in low-resource RE tasks. At first, we fully utilize the knowledge of triplets as prompts to generate sentences by employing the Large Language Models (LLMs). Since the auto-generated data contains noise, we then propose a ranking-based data selection method to select reliable sentences. Finally, we integrate the data selection and RE model training within a self-supervised iterative framework. Through experimentation on three datasets with low-resource settings, we demonstrate the effectiveness of our proposed approach in constructing annotated data and achieving noteworthy improvements in comparison to multiple baselines. Code, data and models are available at https://github.com/jjyunlp/GenerationRE. \ No newline at end of file diff --git a/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos b/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos new file mode 100644 index 0000000000..b7686e6c11 --- /dev/null +++ b/data/2024/aaai/Relightable and Animatable Neural Avatars from Videos @@ -0,0 +1 @@ +Lightweight creation of 3D digital avatars is a highly desirable but challenging task. With only sparse videos of a person under unknown illumination, we propose a method to create relightable and animatable neural avatars, which can be used to synthesize photorealistic images of humans under novel viewpoints, body poses, and lighting. The key challenge here is to disentangle the geometry, material of the clothed body, and lighting, which becomes more difficult due to the complex geometry and shadow changes caused by body motions. To solve this ill-posed problem, we propose novel techniques to better model the geometry and shadow changes. For geometry change modeling, we propose an invertible deformation field, which helps to solve the inverse skinning problem and leads to better geometry quality. To model the spatial and temporal varying shading cues, we propose a pose-aware part-wise light visibility network to estimate light occlusion. Extensive experiments on synthetic and real datasets show that our approach reconstructs high-quality geometry and generates realistic shadows under different body poses. Code and data are available at https://wenbin-lin.github.io/RelightableAvatar-page. \ No newline at end of file diff --git a/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal b/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal new file mode 100644 index 0000000000..f8384cd4fe --- /dev/null +++ b/data/2024/aaai/Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal @@ -0,0 +1 @@ +Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a singular branch, leading to residual watermarks in the predictions and ignoring cases where watermarks heavily obscure the background. To address these limitations, this study introduces the Removing Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI embodies a two-stage approach: the initial phase centers on discerning and segregating the watermark component, while the subsequent phase focuses on background content restoration. To achieve meticulous background restoration, our proposed model employs a dual-path network capable of fully exploring the intrinsic background information beneath semi-transparent watermarks and peripheral contextual information from unaffected regions. Moreover, a Global and Local Context Interaction module is built upon multi-layer perceptrons and bidirectional feature transformation for comprehensive representation modeling in the background restoration phase. The efficacy of our approach is empirically validated across two large-scale datasets, and our findings reveal a marked enhancement over existing watermark removal techniques. \ No newline at end of file diff --git a/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning b/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning new file mode 100644 index 0000000000..398bc0b238 --- /dev/null +++ b/data/2024/aaai/Representation-Based Robustness in Goal-Conditioned Reinforcement Learning @@ -0,0 +1 @@ +While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness against adversarial perturbations remains unexplored. The attacks and robust representation training methods that are designed for traditional RL become less effective when applied to GCRL. To address this challenge, we first propose the Semi-Contrastive Representation attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Then, to mitigate the vulnerability of existing GCRL algorithms, we introduce Adversarial Representation Tactics, which combines Semi-Contrastive Adversarial Augmentation with Sensitivity-Aware Regularizer to improve the adversarial robustness of the underlying RL agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence methods across multiple state-of-the-art GCRL algorithms. Our code is available at https://github.com/TrustAI/ReRoGCRL. \ No newline at end of file diff --git a/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods b/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods new file mode 100644 index 0000000000..3e0a56c7d2 --- /dev/null +++ b/data/2024/aaai/Reproduce, Replicate, Reevaluate. The Long but Safe Way to Extend Machine Learning Methods @@ -0,0 +1 @@ +Reproducibility is a desirable property of scientific research. On the one hand, it increases confidence in results. On the other hand, reproducible results can be extended on a solid basis. In rapidly developing fields such as machine learning, the latter is particularly important to ensure the reliability of research. In this paper, we present a systematic approach to reproducing (using the available implementation), replicating (using an alternative implementation) and reevaluating (using different datasets) state-of-the-art experiments. This approach enables the early detection and correction of deficiencies and thus the development of more robust and transparent machine learning methods. We detail the independent reproduction, replication, and reevaluation of the initially published experiments with a method that we want to extend. For each step, we identify issues and draw lessons learned. We further discuss solutions that have proven effective in overcoming the encountered problems. This work can serve as a guide for further reproducibility studies and generally improve reproducibility in machine learning. \ No newline at end of file diff --git a/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution b/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution new file mode 100644 index 0000000000..4084f6a8f4 --- /dev/null +++ b/data/2024/aaai/ResDiff: Combining CNN and Diffusion Model for Image Super-resolution @@ -0,0 +1 @@ +Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR space to guide the noise towards HR space, ResDiff utilizes the CNN’s initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples. \ No newline at end of file diff --git a/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching b/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching new file mode 100644 index 0000000000..3935f6219e --- /dev/null +++ b/data/2024/aaai/ResMatch: Residual Attention Learning for Feature Matching @@ -0,0 +1 @@ +Attention-based graph neural networks have made great progress in feature matching. However, the literature lacks a comprehensive understanding of how the attention mechanism operates for feature matching. In this paper, we rethink cross- and self-attention from the viewpoint of traditional feature matching and filtering. To facilitate the learning of matching and filtering, we incorporate the similarity of descriptors into cross-attention and relative positions into self-attention. In this way, the attention can concentrate on learning residual matching and filtering functions with reference to the basic functions of measuring visual and spatial correlation. Moreover, we leverage descriptor similarity and relative positions to extract inter- and intra-neighbors. Then sparse attention for each point can be performed only within its neighborhoods to acquire higher computation efficiency. Extensive experiments, including feature matching, pose estimation and visual localization, confirm the superiority of the proposed method. Our codes are available at https://github.com/ACuOoOoO/ResMatch. \ No newline at end of file diff --git a/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract) b/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract) new file mode 100644 index 0000000000..05927e81aa --- /dev/null +++ b/data/2024/aaai/Research of Event Reconstruct Based on Multi-View Contrastive Learning (Student Abstract) @@ -0,0 +1 @@ +The proliferation of social media exacerbates information fragmentation, posing challenges to understanding public events. We address the problem of event reconstruction with a novel Multi-view Contrast Event Reconstruction (MCER) model. MCER maximizes feature dissimilarity between different views of the same event using contrastive learning, while minimizing mutual information between distinct events. This aggregates fragmented views to reconstruct comprehensive event representations. MCER employs momentum and weight-sharing encoders in a three-tower architecture with supervised contrastive loss for multi-view representation learning. Due to the scarcity of multi-view public datasets, we construct a new Mul-view-data benchmark.Experiments demonstrate MCER’s superior performance on public data and our Mul-view-data, significantly outperforming selfsupervised methods by incorporating supervised contrastive techniques. MCER advances multi-view representation learning to counter information fragmentation and enable robust event understanding. \ No newline at end of file diff --git a/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks b/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks new file mode 100644 index 0000000000..e6b6c0b4f7 --- /dev/null +++ b/data/2024/aaai/Residual Hyperbolic Graph Convolution Networks @@ -0,0 +1,2 @@ +Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. +Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds. \ No newline at end of file diff --git a/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective b/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective new file mode 100644 index 0000000000..b8dad70230 --- /dev/null +++ b/data/2024/aaai/Resisting Backdoor Attacks in Federated Learning via Bidirectional Elections and Individual Perspective @@ -0,0 +1 @@ +Existing approaches defend against backdoor attacks in federated learning (FL) mainly through a) mitigating the impact of infected models, or b) excluding infected models. The former negatively impacts model accuracy, while the latter usually relies on globally clear boundaries between benign and infected model updates. However, in reality, model updates can easily become mixed and scattered throughout due to the diverse distributions of local data. This work focuses on excluding infected models in FL. Unlike previous perspectives from a global view, we propose Snowball, a novel anti-backdoor FL framework through bidirectional elections from an individual perspective inspired by one principle deduced by us and two principles in FL and deep learning. It is characterized by a) bottom-up election, where each candidate model update votes to several peer ones such that a few model updates are elected as selectees for aggregation; and b) top-down election, where selectees progressively enlarge themselves through picking up from the candidates. We compare Snowball with state-of-the-art defenses to backdoor attacks in FL on five real-world datasets, demonstrating its superior resistance to backdoor attacks and slight impact on the accuracy of the global model. \ No newline at end of file diff --git a/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research? b/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research? new file mode 100644 index 0000000000..178de21fa6 --- /dev/null +++ b/data/2024/aaai/Resource Democratization: Is Compute the Binding Constraint on AI Research? @@ -0,0 +1 @@ +Access to compute is widely viewed as a primary barrier to AI research progress. Compute resource stratification between academic and industry researchers is therefore a source of concern. Yet the experiences of researchers who might encounter resource constraints in their work have received no direct study. We addressed this gap by conducting a large survey of AI researchers that posed questions about project inputs, outcomes, and challenges. Contrary to popular narratives, responses from more than 500 participants revealed more concern about talent and data limitations than compute access. There were few differences between academic and industry researchers in this regard. The exception were researchers who already use large amounts of compute, and expressed a need for more. These findings suggest that interventions to subsidize compute without addressing the limitations on talent and data availability reported by our respondents might cause or exacerbate commonly cited resource inequalities, with unknown impact on the future of equitable research. \ No newline at end of file diff --git a/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment b/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment new file mode 100644 index 0000000000..f15294aaa5 --- /dev/null +++ b/data/2024/aaai/Resource Efficient Deep Learning Hardware Watermarks with Signature Alignment @@ -0,0 +1 @@ +Deep learning intellectual properties (IPs) are high-value assets that are frequently susceptible to theft. This vulnerability has led to significant interest in defending the field's intellectual properties from theft. Recently, watermarking techniques have been extended to protect deep learning hardware from privacy. These technique embed modifications that change the hardware's behavior when activated. In this work, we propose the first method for embedding watermarks in deep learning hardware that incorporates the owner's key samples into the embedding methodology. This improves our watermarks' reliability and efficiency in identifying the hardware over those generated using randomly selected key samples. Our experimental results demonstrate that by considering the target key samples when generating the hardware modifications, we can significantly increase the embedding success rate while targeting fewer functional blocks, decreasing the required hardware overhead needed to defend it. \ No newline at end of file diff --git a/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model b/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model new file mode 100644 index 0000000000..d244eb3db5 --- /dev/null +++ b/data/2024/aaai/Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model @@ -0,0 +1 @@ +Call-and-response is a musical technique that enriches the creativity of music, crafting coherent musical ideas that mirror the back-and-forth nature of human dialogue with distinct musical characteristics. Although this technique is integral to numerous musical compositions, it remains largely uncharted in automatic music composition. To enhance the creativity of machine-composed music, we first introduce the Call-Response Dataset (CRD) containing 19,155 annotated musical pairs and crafted comprehensive objective evaluation metrics for musical assessment. Then, we design a knowledge-enhanced learning-based method to bridge the gap between human and machine creativity. Specifically, we train the composition module using the call-response pairs, supplementing it with musical knowledge in terms of rhythm, melody, and harmony. Our experimental results underscore that our proposed model adeptly produces a wide variety of creative responses for various musical calls. \ No newline at end of file diff --git a/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation b/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation new file mode 100644 index 0000000000..74eb4a7517 --- /dev/null +++ b/data/2024/aaai/Response Enhanced Semi-supervised Dialogue Query Generation @@ -0,0 +1,4 @@ +Leveraging vast and continually updated knowledge from the Internet has been considered an important ability for a dialogue system. Therefore, the dialogue query generation task is proposed for generating search queries from dialogue histories, which will be submitted to a search engine for retrieving relevant websites on the Internet. In this regard, previous efforts were devoted to collecting conversations with annotated queries and training a query producer (QP) via standard supervised learning. However, these studies still face the challenges of data scarcity and domain adaptation. +To address these issues, in this paper, we propose a semi-supervised learning framework -- SemiDQG, to improve model performance with unlabeled conversations. Based on the observation that the search query is typically related to the topic of dialogue response, we train a response-augmented query producer (RA) to provide rich and effective training signals for QP. +We first apply a similarity-based query selection strategy to select high-quality RA-generated pseudo queries, which are used to construct pseudo instances for training QP and RA. +Then, we adopt the REINFORCE algorithm to further enhance QP, with RA-provided rewards as fine-grained training signals. Experimental results and in-depth analysis of three benchmarks show the effectiveness of our framework in cross-domain and low-resource scenarios. Particularly, SemiDQG significantly surpasses ChatGPT and competitive baselines. Our code is available at \url{https://github.com/DeepLearnXMU/SemiDQG}. \ No newline at end of file diff --git a/data/2024/aaai/Responsibility in Extensive Form Games b/data/2024/aaai/Responsibility in Extensive Form Games new file mode 100644 index 0000000000..84f42287db --- /dev/null +++ b/data/2024/aaai/Responsibility in Extensive Form Games @@ -0,0 +1,3 @@ +Two different forms of responsibility, counterfactual and seeing-to-it, have been extensively discussed in philosophy and AI in the context of a single agent or multiple agents acting simultaneously. Although the generalisation of counterfactual responsibility to a setting where multiple agents act in some order is relatively straightforward, the same cannot be said about seeing-to-it responsibility. Two versions of seeing-to-it modality applicable to such settings have been proposed in the literature. Neither of them perfectly captures the intuition of responsibility. The paper proposes a definition of seeing-to-it responsibility for such settings that amalgamate the two modalities. + +The paper shows that the newly proposed notion of responsibility and counterfactual responsibility are not definable through each other and studies the responsibility gap for these two forms of responsibility. It shows that although these two forms of responsibility are not enough to ascribe responsibility in each possible situation, this gap does not exist if higher-order responsibility is taken into account. \ No newline at end of file diff --git a/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility b/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility new file mode 100644 index 0000000000..9bbf3b847a --- /dev/null +++ b/data/2024/aaai/Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility @@ -0,0 +1 @@ +For ensuring the safety of users by protecting the privacy, the traditional privacy-preserving bandit algorithm aiming to maximize the mean reward has been widely studied in scenarios such as online ride-hailing, advertising recommendations, and personalized healthcare. However, classical bandit learning is irresponsible in such practical applications as they fail to account for risks in online decision-making and ignore external system information. This paper firstly proposes privacy protected mean-volatility utility as the objective of bandit learning and proves its responsibility, because it aims at achieving the maximum probability of utility by considering the risk. Theoretically, our proposed responsible bandit learning is expected to achieve the fastest convergence rate among current bandit algorithms and generates more statistical power than classical normality-based test. Finally, simulation studies provide supporting evidence for the theoretical results and demonstrate stronger performance when using stricter privacy budgets. \ No newline at end of file diff --git a/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition b/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition new file mode 100644 index 0000000000..974b0111c5 --- /dev/null +++ b/data/2024/aaai/Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition @@ -0,0 +1 @@ +Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments. \ No newline at end of file diff --git a/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums b/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums new file mode 100644 index 0000000000..659b8a9c06 --- /dev/null +++ b/data/2024/aaai/RetLLM-E: Retrieval-Prompt Strategy for Question-Answering on Student Discussion Forums @@ -0,0 +1,5 @@ +This paper focuses on using Large Language Models to support teaching assistants in answering questions on large student forums such as Piazza and EdSTEM. Since student questions on these forums are often closely tied to specific aspects of the institution, instructor, and course delivery, general-purpose LLMs do not directly do well on this task. +We introduce RetLLM-E, a method that combines text-retrieval and prompting approaches to enable LLMs to provide precise and high-quality answers to student questions. When presented with a student question, our system initiates a two-step process. First, it retrieves relevant context from (i) a dataset of student questions addressed by course instructors +(Q&A Retrieval) and (ii) relevant segments of course materials (Document Retrieval). RetLLM-E then prompts LLM using the retrieved text and an engineered prompt structure to +yield an answer optimized for the student question. +We present a set of quantitative and human evaluation experiments, comparing our method to ground truth answers to questions in a test set of actual student questions. Our results demonstrate that our approach provides higher-quality responses to course-related questions than an LLM operating without context or relying solely on retrieval-based context. RetLLM-E can easily be adopted in different courses, providing instructors and students with context-aware automatic responses. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract) b/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract) new file mode 100644 index 0000000000..de8d7431a9 --- /dev/null +++ b/data/2024/aaai/Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (Student Abstract) @@ -0,0 +1,2 @@ +This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these ”attentionless Transformers” to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward +networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks b/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks new file mode 100644 index 0000000000..2886d51b01 --- /dev/null +++ b/data/2024/aaai/Rethinking Causal Relationships Learning in Graph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module. The codes are available at https://github.com/yaoyao-yaoyao-cell/CRCG. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective b/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective new file mode 100644 index 0000000000..5ea191f5fb --- /dev/null +++ b/data/2024/aaai/Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective @@ -0,0 +1 @@ +Graph contrastive learning is a general learning paradigm excelling at capturing invariant information from diverse perturbations in graphs. Recent works focus on exploring the structural rationale from graphs, thereby increasing the discriminability of the invariant information. However, such methods may incur in the mis-learning of graph models towards the interpretability of graphs, and thus the learned noisy and task-agnostic information interferes with the prediction of graphs. To this end, with the purpose of exploring the intrinsic rationale of graphs, we accordingly propose to capture the dimensional rationale from graphs, which has not received sufficient attention in the literature. The conducted exploratory experiments attest to the feasibility of the aforementioned roadmap. To elucidate the innate mechanism behind the performance improvement arising from the dimensional rationale, we rethink the dimensional rationale in graph contrastive learning from a causal perspective and further formalize the causality among the variables in the pre-training stage to build the corresponding structural causal model. On the basis of the understanding of the structural causal model, we propose the dimensional rationale-aware graph contrastive learning approach, which introduces a learnable dimensional rationale acquiring network and a redundancy reduction constraint. The learnable dimensional rationale acquiring network is updated by leveraging a bi-level meta-learning technique, and the redundancy reduction constraint disentangles the redundant features through a decorrelation process during learning. Empirically, compared with state-of-the-art methods, our method can yield significant performance boosts on various benchmarks with respect to discriminability and transferability. The code implementation of our method is available at https://github.com/ByronJi/DRGCL. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity b/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity new file mode 100644 index 0000000000..36b9e85163 --- /dev/null +++ b/data/2024/aaai/Rethinking Graph Masked Autoencoders through Alignment and Uniformity @@ -0,0 +1 @@ +Self-supervised learning on graphs can be bifurcated into contrastive and generative methods. Contrastive methods, also known as graph contrastive learning (GCL), have dominated graph self-supervised learning in the past few years, but the recent advent of graph masked autoencoder (GraphMAE) rekindles the momentum behind generative methods. Despite the empirical success of GraphMAE, there is still a dearth of theoretical understanding regarding its efficacy. Moreover, while both generative and contrastive methods have been shown to be effective, their connections and differences have yet to be thoroughly investigated. Therefore, we theoretically build a bridge between GraphMAE and GCL, and prove that the node-level reconstruction objective in GraphMAE implicitly performs context-level GCL. Based on our theoretical analysis, we further identify the limitations of the GraphMAE from the perspectives of alignment and uniformity, which have been considered as two key properties of high-quality representations in GCL. We point out that GraphMAE's alignment performance is restricted by the masking strategy, and the uniformity is not strictly guaranteed. To remedy the aforementioned limitations, we propose an Alignment-Uniformity enhanced Graph Masked AutoEncoder, named AUG-MAE. Specifically, we propose an easy-to-hard adversarial masking strategy to provide hard-to-align samples, which improves the alignment performance. Meanwhile, we introduce an explicit uniformity regularizer to ensure the uniformity of the learned representations. Experimental results on benchmark datasets demonstrate the superiority of our model over existing state-of-the-art methods. The code is available at: https://github.com/AzureLeon1/AUG-MAE. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking b/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking new file mode 100644 index 0000000000..bb7dc08a74 --- /dev/null +++ b/data/2024/aaai/Rethinking Mesh Watermark: Towards Highly Robust and Adaptable Deep 3D Mesh Watermarking @@ -0,0 +1 @@ +The goal of 3D mesh watermarking is to embed the message in 3D meshes that can withstand various attacks imperceptibly and reconstruct the message accurately from watermarked meshes. The watermarking algorithm is supposed to withstand multiple attacks, and the complexity should not grow significantly with the mesh size. Unfortunately, previous methods are less robust against attacks and lack of adaptability. In this paper, we propose a robust and adaptable deep 3D mesh watermarking Deep3DMark that leverages attention-based convolutions in watermarking tasks to embed binary messages in vertex distributions without texture assistance. Furthermore, our Deep3DMark exploits the property that simplified meshes inherit similar relations from the original ones, where the relation is the offset vector directed from one vertex to its neighbor. By doing so, our method can be trained on simplified meshes but remains effective on large size meshes (size adaptable) and unseen categories of meshes (geometry adaptable). Extensive experiments demonstrate our method remains efficient and effective even if the mesh size is 190× increased. Under mesh attacks, Deep3DMark achieves 10%∼50% higher accuracy than traditional methods, and 2× higher SNR and 8% higher accuracy than previous DNN-based methods. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer b/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer new file mode 100644 index 0000000000..8742e1fd81 --- /dev/null +++ b/data/2024/aaai/Rethinking Multi-Scale Representations in Deep Deraining Transformer @@ -0,0 +1 @@ +Existing Transformer-based image deraining methods depend mostly on fixed single-input single-output U-Net architecture. In fact, this not only neglects the potentially explicit information from multiple image scales, but also lacks the capability of exploring the complementary implicit information across different scales. In this work, we rethink the multi-scale representations and design an effective multi-input multi-output framework that constructs intra- and inter-scale hierarchical modulation to better facilitate rain removal and help image restoration. We observe that rain levels reduce dramatically in coarser image scales, thus proposing to restore rain-free results from the coarsest scale to the finest scale in image pyramid inputs, which also alleviates the difficulty of model learning. Specifically, we integrate a sparsity-compensated Transformer block and a frequency-enhanced convolutional block into a coupled representation module, in order to jointly learn the intra-scale content-aware features. To facilitate representations learned at different scales to communicate with each other, we leverage a gated fusion module to adaptively aggregate the inter-scale spatial-aware features, which are rich in correlated information of rain appearances, leading to high-quality results. Extensive experiments demonstrate that our model achieves consistent gains on five benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role b/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role new file mode 100644 index 0000000000..bd3f3cb2c6 --- /dev/null +++ b/data/2024/aaai/Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima's Role @@ -0,0 +1 @@ +Recent significant advancements in diffusion models have revolutionized image generation, enabling the synthesis of highly realistic images with text-based guidance. These breakthroughs have paved the way for constructing datasets via generative artificial intelligence (AI), offering immense potential for various applications. However, two critical challenges hinder the widespread adoption of synthesized data: computational cost and the generation of peculiar images. While computational costs have improved through various approaches, the issue of peculiar image generation remains relatively unexplored. Existing solutions rely on heuristics, extra training, or AI-based post-processing to mitigate this problem. In this paper, we present a novel approach to address both issues simultaneously. We establish that both gradient descent and diffusion sampling are specific cases of the generalized expectation maximization algorithm. We hypothesize and empirically demonstrate that peculiar image generation is akin to the local minima problem in optimization. Inspired by optimization techniques, we apply naive momentum and positive-negative momentum to diffusion sampling. Last, we propose new metrics to evaluate the peculiarity. Experimental results show momentum effectively prevents peculiar image generation without extra computation. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation b/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation new file mode 100644 index 0000000000..61819a924e --- /dev/null +++ b/data/2024/aaai/Rethinking Propagation for Unsupervised Graph Domain Adaptation @@ -0,0 +1 @@ +Unsupervised Graph Domain Adaptation (UGDA) aims to transfer knowledge from a labelled source graph to an unlabelled target graph in order to address the distribution shifts between graph domains. Previous works have primarily focused on aligning data from the source and target graph in the representation space learned by graph neural networks (GNNs). However, the inherent generalization capability of GNNs has been largely overlooked. Motivated by our empirical analysis, we reevaluate the role of GNNs in graph domain adaptation and uncover the pivotal role of the propagation process in GNNs for adapting to different graph domains. We provide a comprehensive theoretical analysis of UGDA and derive a generalization bound for multi-layer GNNs. By formulating GNN Lipschitz for k-layer GNNs, we show that the target risk bound can be tighter by removing propagation layers in source graph and stacking multiple propagation layers in target graph. Based on the empirical and theoretical analysis mentioned above, we propose a simple yet effective approach called A2GNN for graph domain adaptation. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed A2GNN framework. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection b/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection new file mode 100644 index 0000000000..321800f2a2 --- /dev/null +++ b/data/2024/aaai/Rethinking Reverse Distillation for Multi-Modal Anomaly Detection @@ -0,0 +1 @@ +In recent years, there has been significant progress in employing color images for anomaly detection in industrial scenarios, but it is insufficient for identifying anomalies that are invisible in RGB images alone. As a supplement, introducing extra modalities such as depth and surface normal maps can be helpful to detect these anomalies. To this end, we present a novel Multi-Modal Reverse Distillation (MMRD) paradigm that consists of a frozen multi-modal teacher encoder to generate distillation targets and a learnable student decoder targeting to restore multi-modal representations from the teacher. Specifically, the teacher extracts complementary visual features from different modalities via a siamese architecture and then parameter-freely fuses these information from multiple levels as the targets of distillation. For the student, it learns modality-related priors from the teacher representations of normal training data and performs interaction between them to form multi-modal representations for target reconstruction. Extensive experiments show that our MMRD outperforms recent state-of-the-art methods on both anomaly detection and localization on MVTec-3D AD and Eyecandies benchmarks. Codes will be available upon acceptance. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Robustness of Model Attributions b/data/2024/aaai/Rethinking Robustness of Model Attributions new file mode 100644 index 0000000000..0427478321 --- /dev/null +++ b/data/2024/aaai/Rethinking Robustness of Model Attributions @@ -0,0 +1 @@ +For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either these methods or the model training. We observe two main causes for fragile attributions: first, the existing metrics of robustness (e.g., top-k intersection) overpenalize even reasonable local shifts in attribution, thereby making random perturbations to appear as a strong attack, and second, the attribution can be concentrated in a small region even when there are multiple important parts in an image. To rectify this, we propose simple ways to strengthen existing metrics and attribution methods that incorporate locality of pixels in robustness metrics and diversity of pixel locations in attributions. Towards the role of model training in attributional robustness, we empirically observe that adversarially trained models have more robust attributions on smaller datasets, however, this advantage disappears in larger datasets. Code is made available at https://github.com/ksandeshk/LENS. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point b/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point new file mode 100644 index 0000000000..7c91fbc004 --- /dev/null +++ b/data/2024/aaai/Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point @@ -0,0 +1 @@ +As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, one-stage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study b/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study new file mode 100644 index 0000000000..d46228068d --- /dev/null +++ b/data/2024/aaai/Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study @@ -0,0 +1 @@ +While large language models (LLMs) exhibit impressive performance on a wide range of NLP tasks, most of them fail to learn the causality from correlation, which disables them from learning rationales for predicting. Rethinking the whole developing process of LLMs is of great urgency as they are adopted in various critical tasks that need rationales, including legal text prediction (e.g., legal judgment prediction). In this paper, we first explain the underlying theoretical mechanism of their failure and argue that both the data imbalance and the omission of causality in model design and selection render the current training-testing paradigm failed to select the unique causality-based model from correlation-based models. Second, we take the legal text prediction task as the testbed and reconstruct the developing process of LLMs by simultaneously infusing causality into model architectures and organizing causality-based adversarial attacks for evaluation. Specifically, we base our reconstruction on our theoretical analysis and propose a causality-aware self-attention mechanism (CASAM), which prevents LLMs from entangling causal and non-causal information by restricting the interaction between causal and non-causal words. Meanwhile, we propose eight kinds of legal-specific attacks to form causality-based model selection. Our extensive experimental results demonstrate that our proposed CASAM achieves state-of-the-art (SOTA) performances and the strongest robustness on three commonly used legal text prediction benchmarks. We make our code publicly available at https://github.com/Carrot-Red/Rethink-LLM-development. \ No newline at end of file diff --git a/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation b/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation new file mode 100644 index 0000000000..6effe05fc9 --- /dev/null +++ b/data/2024/aaai/Rethinking the Paradigm of Content Constraints in Unpaired Image-to-Image Translation @@ -0,0 +1 @@ +In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the encoder and decoder of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods. \ No newline at end of file diff --git a/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention b/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention new file mode 100644 index 0000000000..5bc5a9c4e8 --- /dev/null +++ b/data/2024/aaai/RetouchFormer: Semi-supervised High-Quality Face Retouching Transformer with Prior-Based Selective Self-Attention @@ -0,0 +1 @@ +Face retouching is to beautify a face image, while preserving the image content as much as possible. It is a promising yet challenging task to remove face imperfections and fill with normal skin. Generic image enhancement methods are hampered by the lack of imperfection localization, which often results in incomplete removal of blemishes at large scales. To address this issue, we propose a transformer-based approach, RetouchFormer, which simultaneously identify imperfections and synthesize realistic content in the corresponding regions. Specifically, we learn a latent dictionary to capture the clean face priors, and predict the imperfection regions via a reconstruction-oriented localization module. Also based on this, we can realize face retouching by explicitly suppressing imperfections in our selective self-attention computation, such that local content will be synthesized from normal skin. On the other hand, multi-scale feature tokens lead to increased flexibility in dealing with the imperfections at various scales. The design elements bring greater effectiveness and efficiency. RetouchFormer outperforms the advanced face retouching methods and synthesizes clean face images with high fidelity in our list of extensive experiments performed. \ No newline at end of file diff --git a/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning b/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning new file mode 100644 index 0000000000..954a0af034 --- /dev/null +++ b/data/2024/aaai/Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning @@ -0,0 +1 @@ +Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by learning from seen compositions. Composing the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions is critical for CZSL. In this work, we propose to explicitly retrieve knowledge of seen primitives for compositional zero-shot learning. We present a retrieval-augmented method, which augments standard multi-path classification methods with two retrieval modules. Specifically, we construct two databases storing the attribute and object representations of training images, respectively. For an input training/testing image, we use two retrieval modules to retrieve representations of training images with the same attribute and object, respectively. The primitive representations of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely-used datasets show the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction b/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction new file mode 100644 index 0000000000..3cc7d6a0ff --- /dev/null +++ b/data/2024/aaai/RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction @@ -0,0 +1 @@ +Machine learning-assisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distribution shifts remains stagnant. To this end, we first formally sort out two types of distribution shifts in retrosynthesis prediction and construct two groups of benchmark datasets. Next, through comprehensive experiments, we systematically compare state-of-the-art retrosynthesis prediction models on the two groups of benchmarks, revealing the limitations of previous in-distribution evaluation and re-examining the advantages of each model. More remarkably, we are motivated by the above empirical insights to propose two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms. Our preliminary experiments show their high potential with an average performance improvement of 4.6%, and the established benchmarks serve as a foothold for further retrosynthesis prediction research towards OOD generalization. \ No newline at end of file diff --git a/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning b/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning new file mode 100644 index 0000000000..3af71fba7e --- /dev/null +++ b/data/2024/aaai/Revealing the Proximate Long-Tail Distribution in Compositional Zero-Shot Learning @@ -0,0 +1 @@ +Compositional Zero-Shot Learning (CZSL) aims to transfer knowledge from seen state-object pairs to novel unseen pairs. In this process, visual bias caused by the diverse interrelationship of state-object combinations blurs their visual features, hindering the learning of distinguishable class prototypes. Prevailing methods concentrate on disentangling states and objects directly from visual features, disregarding potential enhancements that could arise from a data viewpoint. Experimentally, we unveil the results caused by the above problem closely approximate the long-tailed distribution. As a solution, we transform CZSL into a proximate class imbalance problem. We mathematically deduce the role of class prior within the long-tailed distribution in CZSL. Building upon this insight, we incorporate visual bias caused by compositions into the classifier's training and inference by estimating it as a proximate class prior. This enhancement encourages the classifier to acquire more discernible class prototypes for each composition, thereby achieving more balanced predictions. Experimental results demonstrate that our approach elevates the model's performance to the state-of-the-art level, without introducing additional parameters. \ No newline at end of file diff --git a/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought b/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought new file mode 100644 index 0000000000..f377027290 --- /dev/null +++ b/data/2024/aaai/Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought @@ -0,0 +1,8 @@ +With the proliferation of dialogic data across the Internet, the Dialogue Commonsense Multi-choice Question Answering (DC-MCQ) task has emerged as a response to the challenge of comprehending user queries and intentions. +Although prevailing methodologies exhibit effectiveness in addressing single-choice questions, they encounter difficulties in handling multi-choice queries due to the heightened intricacy and informational density. +In this paper, inspired by the human cognitive process of progressively excluding options, we propose a three-step Reverse Exclusion Graph-of-Thought (ReX-GoT) framework, including Option Exclusion, Error Analysis, and Combine Information. +Specifically, our ReX-GoT mimics human reasoning by gradually excluding irrelevant options and learning the reasons for option errors to choose the optimal path of the GoT and ultimately infer the correct answer. +By progressively integrating intricate clues, our method effectively reduces the difficulty of multi-choice reasoning and provides a novel solution for DC-MCQ. +Extensive experiments on the CICERO and CICERO_v2 datasets validate the significant improvement of our approach on DC-MCQ task. +On zero-shot setting, our model outperform the best baseline by 17.67% in terms of F1 score for the multi-choice task. +Most strikingly, our GPT3.5-based ReX-GoT framework achieves a remarkable 39.44% increase in F1 score. \ No newline at end of file diff --git a/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation b/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation new file mode 100644 index 0000000000..a1e4fcf9fd --- /dev/null +++ b/data/2024/aaai/Review-Enhanced Hierarchical Contrastive Learning for Recommendation @@ -0,0 +1 @@ +Designed to establish potential relations and distill high-order representations, graph-based recommendation systems continue to reveal promising results by jointly modeling ratings and reviews. However, existing studies capture simple review relations, failing to (1) completely explore hidden connections between users (or items), (2) filter out redundant information derived from reviews, and (3) model the behavioral association between rating and review interactions. To address these challenges, we propose a review-enhanced hierarchical contrastive learning, namely ReHCL. First, ReHCL constructs topic and semantic graphs to fully mine review relations from different views. Moreover, a cross-view graph contrastive learning is used to achieve enhancement of node representations and extract useful review knowledge. Meanwhile, we design a neighbor-based positive sampling to capture the graph-structured similarity between topic and semantic views, further performing efficient contrast and reducing redundant noise. Next, we propose a cross-modal contrastive learning to match the rating and review representations, by exploring the association between ratings and reviews. Lastly, these two contrastive learning modes form a hierarchical contrastive learning task, which is applied to enhance the final recommendation task. Extensive experiments verify the superiority of ReHCL compared with state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors b/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors new file mode 100644 index 0000000000..acfaf708c6 --- /dev/null +++ b/data/2024/aaai/Reviewing the Forgotten Classes for Domain Adaptation of Black-Box Predictors @@ -0,0 +1 @@ +For addressing the data privacy and portability issues of domain adaptation, Domain Adaptation of Black-box Predictors (DABP) aims to adapt a black-box source model to an unlabeled target domain without accessing both the source-domain data and details of the source model. Although existing DABP approaches based on knowledge distillation (KD) have achieved promising results, we experimentally find that these methods all have the minority class forgetting issue, which refers that the trained model completely forgets some minority classes. To address this issue, we propose a method called Reviewing the Forgotten Classes (RFC), which including two main modules. Firstly, we propose a simple but effective component called selection training (ST). ST selects classes that the model tends to forget according to the learning status of the model and obtains clean samples of the selected classes with the small-loss criterion for enhanced training. ST is orthogonal to previous methods and can effectively alleviate their minority class forgetting issue. Secondly, we find that neighborhood clustering (NC) can help the model learn more balanced than KD so that further alleviate the minority class forgetting issue. However, NC is based on the fact that target features from the source model already form some semantic structure, while DABP is unable to obtain the source model. Thus, we use KD and ST to warm up the target model to form a certain semantic structure. Overall, our method inherits the merits of both ST and NC, and achieves state of the art on three DABP benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning b/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning new file mode 100644 index 0000000000..ff00923b3e --- /dev/null +++ b/data/2024/aaai/Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning @@ -0,0 +1 @@ +In representation learning, a disentangled representation is highly desirable as it encodes generative factors of data in a separable and compact pattern. Researchers have advocated leveraging disentangled representations to complete downstream tasks with encouraging empirical evidence. This paper further investigates the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are unnecessary on a fundamental downstream task, abstract visual reasoning. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Furthermore, our findings suggest that the informativeness of representations is a better indicator of downstream performance than disentanglement. Finally, the positive correlation between informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works. The source code is available at https://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git \ No newline at end of file diff --git a/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction b/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction new file mode 100644 index 0000000000..adaa628171 --- /dev/null +++ b/data/2024/aaai/Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction @@ -0,0 +1 @@ +Document-level relation extraction (DocRE) poses the challenge of identifying relationships between entities within a document. Existing approaches rely on logical reasoning or contextual cues from entities. This paper reframes document-level RE as link prediction over a Knowledge Graph (KG) with distinct benefits: 1) Our approach amalgamates entity context and document-derived logical reasoning, enhancing link prediction quality. 2) Predicted links between entities offer interpretability, elucidating employed reasoning. We evaluate our approach on benchmark datasets - DocRED, ReDocRED, and DWIE. The results indicate that our proposed method outperforms the state-of-the-art models and suggests that incorporating context-based Knowledge Graph link prediction techniques can enhance the performance of document-level relation extraction models. \ No newline at end of file diff --git a/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks b/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks new file mode 100644 index 0000000000..0f87d78529 --- /dev/null +++ b/data/2024/aaai/Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks @@ -0,0 +1 @@ +Collaborative learning (CL) is a distributed learning framework that aims to protect user privacy by allowing users to jointly train a model by sharing their gradient updates only. However, gradient inversion attacks (GIAs), which recover users' training data from shared gradients, impose severe privacy threats to CL. Existing defense methods adopt different techniques, e.g., differential privacy, cryptography, and perturbation defenses, to defend against the GIAs. Nevertheless, all current defense methods suffer from a poor trade-off between privacy, utility, and efficiency. To mitigate the weaknesses of existing solutions, we propose a novel defense method, Dual Gradient Pruning (DGP), based on gradient pruning, which can improve communication efficiency while preserving the utility and privacy of CL. Specifically, DGP slightly changes gradient pruning with a stronger privacy guarantee. And DGP can also significantly improve communication efficiency with a theoretical analysis of its convergence and generalization. Our extensive experiments show that DGP can effectively defend against the most powerful GIAs and reduce the communication cost without sacrificing the model's utility. \ No newline at end of file diff --git a/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum b/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum new file mode 100644 index 0000000000..2bdac404fc --- /dev/null +++ b/data/2024/aaai/Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum @@ -0,0 +1 @@ +Graph-based fraud detection (GFD) can be regarded as a challenging semi-supervised node binary classification task. In recent years, Graph Neural Networks (GNN) have been widely applied to GFD, characterizing the anomalous possibility of a node by aggregating neighbor information. However, fraud graphs are inherently heterophilic, thus most of GNNs perform poorly due to their assumption of homophily. In addition, due to the existence of heterophily and class imbalance problem, the existing models do not fully utilize the precious node label information. To address the above issues, this paper proposes a semi-supervised GNN-based fraud detector SEC-GFD. This detector includes a hybrid filtering module and a local environmental constraint module, the two modules are utilized to solve heterophily and label utilization problem respectively. The first module starts from the perspective of the spectral domain, and solves the heterophily problem to a certain extent. Specifically, it divides the spectrum into various mixed-frequency bands based on the correlation between spectrum energy distribution and heterophily. Then in order to make full use of the node label information, a local environmental constraint module is adaptively designed. The comprehensive experimental results on four real-world fraud detection datasets denote that SEC-GFD outperforms other competitive graph-based fraud detectors. We release our code at https://github.com/Sunxkissed/SEC-GFD. \ No newline at end of file diff --git a/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation b/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation new file mode 100644 index 0000000000..b5344fd889 --- /dev/null +++ b/data/2024/aaai/Revisiting Open-Set Panoptic Segmentation @@ -0,0 +1 @@ +In this paper, we focus on the open-set panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond b/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond new file mode 100644 index 0000000000..977d7c00f1 --- /dev/null +++ b/data/2024/aaai/Revisiting the Information Capacity of Neural Network Watermarks: Upper Bound Estimation and Beyond @@ -0,0 +1,7 @@ +To trace the copyright of deep neural networks, an owner can embed its identity information into its model as a watermark. +The capacity of the watermark quantify the maximal volume of information that can be verified from the watermarked model. +Current studies on capacity focus on the ownership verification accuracy under ordinary removal attacks and fail to capture the relationship between robustness and fidelity. +This paper studies the capacity of deep neural network watermarks from an information theoretical perspective. +We propose a new definition of deep neural network watermark capacity analogous to channel capacity, analyze its properties, and design an algorithm that yields a tight estimation of its upper bound under adversarial overwriting. +We also propose a universal non-invasive method to secure the transmission of the identity message beyond capacity by multiple rounds of ownership verification. +Our observations provide evidence for neural network owners and defenders that are curious about the tradeoff between the integrity of their ownership and the performance degradation of their products. \ No newline at end of file diff --git a/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes b/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes new file mode 100644 index 0000000000..c2708656bf --- /dev/null +++ b/data/2024/aaai/Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes @@ -0,0 +1,3 @@ +The Bahnar, a minority ethnic group in Vietnam with ancient roots, hold a language of deep cultural and historical significance. The government is prioritizing the preservation and dissemination of Bahnar language through online availability and cross-generational communication. Recent AI advances, including Neural Machine Translation (NMT), have transformed translation with improved accuracy and fluency, fostering language revitalization through learning, communication, and documentation. In particular, NMT enhances accessibility for Bahnar language speakers, making information and content more available. + +However, translating Vietnamese to Bahnar language faces practical hurdles due to resource limitations, particularly in the case of Bahnar language as an extremely low-resource language. These challenges encompass data scarcity, vocabulary constraints, and a lack of fine-tuning data. To address these, we propose transfer learning from selected pre-trained models to optimize translation quality and computational efficiency, capitalizing on linguistic similarities between Vietnamese and Bahnar language. Concurrently, we apply tailored augmentation strategies to adapt machine translation for the Vietnamese-Bahnar language context. Our approach is validated through superior results on bilingual Vietnamese-Bahnar language datasets when compared to baseline models. By tackling translation challenges, we help revitalize Bahnar language, ensuring information flows freely and the language thrives. \ No newline at end of file diff --git a/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems b/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems new file mode 100644 index 0000000000..3691006798 --- /dev/null +++ b/data/2024/aaai/Revolutionizing Education through AI-Powered Inclusive Learning Systems @@ -0,0 +1,5 @@ +This proposal introduces an innovative AI-powered learning system designed to address educational disparities worldwide. Focused on developing countries, the system seamlessly translates educational content between English and native languages, breaking down language barriers. Leveraging advanced natural language processing and machine learning techniques, including transformer models like BERT and GPT-3, the system ensures inclusivity, effectiveness, and engagement. + +Built on prior research demonstrating AI's efficacy in language translation and personalized learning, the proposed system draws inspiration from successful projects like Duolingo Language Incubator. By providing inclusive and accessible learning experiences, it empowers individuals to overcome language barriers, fostering global participation. + +The potential impact is significant, with the system poised to accelerate learning, enhance literacy rates, and create a more skilled workforce in developing countries. This research reflects a commitment to revolutionize education through technology, aiming for lasting and transformative contributions to global society. Through AI-driven education, a brighter, more inclusive future is envisioned. \ No newline at end of file diff --git a/data/2024/aaai/Reward (Mis)design for Autonomous Driving (Abstract Reprint) b/data/2024/aaai/Reward (Mis)design for Autonomous Driving (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning b/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning new file mode 100644 index 0000000000..b05b05f05b --- /dev/null +++ b/data/2024/aaai/Reward Certification for Policy Smoothed Reinforcement Learning @@ -0,0 +1 @@ +Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas, but it can be weakened by adversarial attacks. Recent studies have introduced ``smoothed policies" to enhance its robustness. Yet, it is still challenging to establish a provable guarantee to certify the bound of its total reward. Prior methods relied primarily on computing bounds using Lipschitz continuity or calculating the probability of cumulative reward being above specific thresholds. However, these techniques are only suited for continuous perturbations on the RL agent's observations and are restricted to perturbations bounded by the l2-norm. To address these limitations, this paper proposes a general black-box certification method, called ReCePS, which is capable of directly certifying the cumulative reward of the smoothed policy under various lp-norm bounded perturbations. Furthermore, we extend our methodology to certify perturbations on action spaces. Our approach leverages f-divergence to measure the distinction between the original distribution and the perturbed distribution, subsequently determining the certification bound by solving a convex optimisation problem. We provide a comprehensive theoretical analysis and run experiments in multiple environments. Our results show that our method not only improves the tightness of certified lower bound of the mean cumulative reward but also demonstrates better efficiency than state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively b/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively new file mode 100644 index 0000000000..700274c431 --- /dev/null +++ b/data/2024/aaai/Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively @@ -0,0 +1 @@ +Constrained Reinforcement Learning employs trajectory-based cost constraints (such as expected cost, Value at Risk, or Conditional VaR cost) to compute safe policies. The challenge lies in handling these constraints effectively while optimizing expected reward. Existing methods convert such trajectory-based constraints into local cost constraints, but they rely on cost estimates, leading to either aggressive or conservative solutions with regards to cost. We propose an unconstrained formulation that employs reward penalties over states augmented with costs to compute safe policies. Unlike standard primal-dual methods, our approach penalizes only infeasible trajectories through state augmentation. This ensures that increasing the penalty parameter always guarantees a feasible policy, a feature lacking in primal-dual methods. Our approach exhibits strong empirical performance and theoretical properties, offering a fresh paradigm for solving complex Constrained RL problems, including rich constraints like expected cost, Value at Risk, and Conditional Value at Risk. Our experimental results demonstrate superior performance compared to leading approaches across various constraint types on multiple benchmark problems. \ No newline at end of file diff --git a/data/2024/aaai/Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint) b/data/2024/aaai/Reward-Respecting Subtasks for Model-Based Reinforcement Learning (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract) b/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract) new file mode 100644 index 0000000000..98b36a6269 --- /dev/null +++ b/data/2024/aaai/Rider Posture-Based Continuous Authentication with Few-Shot Learning for Mobility Scooters (Student Abstract) @@ -0,0 +1 @@ +Current practice of mobility scooter user authentication using physical keys and traditional password-based one-time security mechanisms cannot meet the needs of many mobility scooter riders, especially senior citizens having issues in recalling memory. Now seamless authentication approaches are needed to provide ongoing protection for mobility scooters against takeovers and unauthorized access. Existing continuous authentication techniques do not work well in a mobility scooter setting due to issues such as user comfort, deployment cost and enrollment time, among others. In that direction, our contributions in this research effort are two-fold: (i) we propose a novel system that incorporates advances in few-shot learning, hierarchical processing, and contextual embedding to establish continuous authentication for mobility scooter riders using only posture data. This security system, trained on data collected from real mobility scooter riders, demonstrates quick enrollment and easy deployability, while successfully serving as an unobtrusive first layer of security. (ii) we provide to the research community the largest publicly available repository of mobility scooter riders' body key-points data to enable further research in this direction. \ No newline at end of file diff --git a/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting b/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting new file mode 100644 index 0000000000..207e63a158 --- /dev/null +++ b/data/2024/aaai/Risk Management in Image Generative Models through Model Fingerprinting @@ -0,0 +1 @@ +My doctoral research delves into the realm of generative model fingerprinting, aiming to assign responsibility for the generated images. I introduce frameworks that modify generative models to incorporate each user's distinct digital fingerprint. This ensures that every piece of generated content carries a traceable identifier linked to its originator. The primary objective of my research is to achieve optimal attribution accuracy while ensuring minimal compromise on the model's performance. Additionally, I present strategies designed to enhance robustness against common adversarial manipulations, which malicious users might employ to obscure or remove these fingerprints. \ No newline at end of file diff --git a/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits b/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits new file mode 100644 index 0000000000..955119d4ee --- /dev/null +++ b/data/2024/aaai/Risk-Aware Continuous Control with Neural Contextual Bandits @@ -0,0 +1 @@ +Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach satisfies consistently the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption). \ No newline at end of file diff --git a/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures b/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures new file mode 100644 index 0000000000..b32785952f --- /dev/null +++ b/data/2024/aaai/Risk-Conditioned Reinforcement Learning: A Generalized Approach for Adapting to Varying Risk Measures @@ -0,0 +1 @@ +In application domains requiring mission-critical decision making, such as finance and robotics, the optimal policy derived by reinforcement learning (RL) often hinges on a preference for risk management. Yet, the dynamic nature of risk measures poses considerable challenges to achieving generalization and adaptation of risk-sensitive policies in the context of RL. In this paper, we propose a risk-conditioned RL model that enables rapid policy adaptation to varying risk measures via a unified risk representation, the Weighted Value-at-Risk (WV@R). To sample risk measures that avoid undue optimism, we construct a risk proposal network employing a conditional adversarial auto-encoder and a normalizing flow. This network establishes coherent representations for risk measures, preserving the continuity in terms of the Wasserstein distance on the risk measures. The normalizing flow is used to support non-crossing quantile regression that obtains valid samples for risk measures, and it is also applied to the agent’s critic to ascertain the preservation of monotonicity in quantile estimations. Through experiments with locomotion, finance, and self-driving scenarios, we show that our model is capable of adapting to a range of risk measures, achieving comparable performance to the baseline models individually trained for each measure. Our model often outperforms the baselines, especially in the cases when exploration is required during training but risk-aversion is favored during evaluation. \ No newline at end of file diff --git a/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion b/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion new file mode 100644 index 0000000000..860841800e --- /dev/null +++ b/data/2024/aaai/Robust 3D Tracking with Quality-Aware Shape Completion @@ -0,0 +1 @@ +3D single object tracking remains a challenging problem due to the sparsity and incompleteness of the point clouds. Existing algorithms attempt to address the challenges in two strategies. The first strategy is to learn dense geometric features based on the captured sparse point cloud. Nevertheless, it is quite a formidable task since the learned dense geometric features are with high uncertainty for depicting the shape of the target object. The other strategy is to aggregate the sparse geometric features of multiple templates to enrich the shape information, which is a routine solution in 2D tracking. However, aggregating the coarse shape representations can hardly yield a precise shape representation. Different from 2D pixels, 3D points of different frames can be directly fused by coordinate transform, i.e., shape completion. Considering that, we propose to construct a synthetic target representation composed of dense and complete point clouds depicting the target shape precisely by shape completion for robust 3D tracking. Specifically, we design a voxelized 3D tracking framework with shape completion, in which we propose a quality-aware shape completion mechanism to alleviate the adverse effect of noisy historical predictions. It enables us to effectively construct and leverage the synthetic target representation. Besides, we also develop a voxelized relation modeling module and box refinement module to improve tracking performance. Favorable performance against state-of-the-art algorithms on three benchmarks demonstrates the effectiveness and generalization ability of our method. \ No newline at end of file diff --git a/data/2024/aaai/Robust Active Measuring under Model Uncertainty b/data/2024/aaai/Robust Active Measuring under Model Uncertainty new file mode 100644 index 0000000000..c6666bfd02 --- /dev/null +++ b/data/2024/aaai/Robust Active Measuring under Model Uncertainty @@ -0,0 +1 @@ +Partial observability and uncertainty are common problems in sequential decision-making that particularly impede the use of formal models such as Markov decision processes (MDPs). However, in practice, agents may be able to employ costly sensors to measure their environment and resolve partial observability by gathering information. Moreover, imprecise transition functions can capture model uncertainty. We combine these concepts and extend MDPs to robust active-measuring MDPs (RAM-MDPs). We present an active-measure heuristic to solve RAM-MDPs efficiently and show that model uncertainty can, counterintuitively, let agents take fewer measurements. We propose a method to counteract this behavior while only incurring a bounded additional cost. We empirically compare our methods to several baselines and show their superior scalability and performance. \ No newline at end of file diff --git a/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective b/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective new file mode 100644 index 0000000000..892af454a9 --- /dev/null +++ b/data/2024/aaai/Robust Beamforming for Downlink Multi-Cell Systems: A Bilevel Optimization Perspective @@ -0,0 +1 @@ +Utilization of inter-base station cooperation for information processing has shown great potential in enhancing the overall quality of communication services (QoS) in wireless communication networks. Nevertheless, such cooperations require the knowledge of channel state information (CSI) at base stations (BSs), which is assumed to be perfectly known. However, CSI errors are inevitable in practice which necessitates beamforming technique that can achieve robust performance in the presence of channel estimation errors. Existing approaches relax the robust beamforming design problems into semidefinite programming (SDP), which can only achieve a solution that is far from being optimal. To this end, this paper views robust beamforming design problems from a bilevel optimization perspective. In particular, we focus on maximizing the worst-case weighted sum-rate (WSR) in the downlink multi-cell multi-user multiple-input single-output (MISO) system considering bounded CSI errors. We first reformulate this problem into a bilevel optimization problem and then develop an efficient algorithm based on the cutting plane method. A distributed optimization algorithm has also been developed to facilitate the parallel processing in practical settings. Numerical results are provided to confirm the effectiveness of the proposed algorithm in terms of performance and complexity, particularly in the presence of CSI uncertainties. \ No newline at end of file diff --git a/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework b/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework new file mode 100644 index 0000000000..056770a729 --- /dev/null +++ b/data/2024/aaai/Robust Blind Text Image Deblurring via Maximum Consensus Framework @@ -0,0 +1 @@ +The blind text image deblurring problem presents a formidable challenge, requiring the recovery of a clean and sharp text image from a blurry version with an unknown blur kernel. Sparsity-based strategies have demonstrated their efficacy by emphasizing the sparse priors of the latent image and kernel. However, these existing strategies have largely neglected the influence of additional noise, imposing limitations on their performance. To overcome this limitation, we propose a novel framework designed to effectively mitigate the impact of extensive noise prevalent in blurred images. Our approach centers around a robust Maximum Consensus Framework, wherein we optimize the quantity of interest from the noisy blurry image based on the maximum consensus criterion. Furthermore, we propose the integration of the Alternating Direction Method of Multipliers (ADMM) and the Half-Quadratic Splitting (HQS) method to address the computationally intractable L0 norm problem. This innovative strategy enables improvements in the deblurring performance of blurry text images with the additional synthetic noise. Experimental evaluations conducted on various noisy blurry text images demonstrate the superiority of the proposed approach over existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense b/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense new file mode 100644 index 0000000000..ce07c02136 --- /dev/null +++ b/data/2024/aaai/Robust Communicative Multi-Agent Reinforcement Learning with Active Defense @@ -0,0 +1 @@ +Communication in multi-agent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks. \ No newline at end of file diff --git a/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds b/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds new file mode 100644 index 0000000000..85c69d821a --- /dev/null +++ b/data/2024/aaai/Robust Distributed Gradient Aggregation Using Projections onto Gradient Manifolds @@ -0,0 +1 @@ +We study the distributed gradient aggregation problem where individual clients contribute to learning a central model by sharing parameter gradients constructed from local losses. However, errors in some gradients, caused by low-quality data or adversaries, can degrade the learning process when naively combined. Existing robust gradient aggregation approaches assume that local data represent the global data-generating distribution, which may not always apply to heterogeneous (non-i.i.d.) client data. We propose a new algorithm that can robustly aggregate gradients from potentially heterogeneous clients. Our approach leverages the manifold structure inherent in heterogeneous client gradients and evaluates gradient anomaly degrees by projecting them onto this manifold. This algorithm is implemented as a simple and efficient method that accumulates random projections within the subspace defined by the nearest neighbors within a gradient cloud. Our experiments demonstrate consistent performance improvements over state-of-the-art robust aggregation algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models b/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models new file mode 100644 index 0000000000..5e152096f4 --- /dev/null +++ b/data/2024/aaai/Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models @@ -0,0 +1 @@ +Many evaluation measures are used to evaluate social biases in masked language models (MLMs). However, we find that these previously proposed evaluation measures are lacking robustness in scenarios with limited datasets. This is because these measures are obtained by comparing the pseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical samples using an indicator function. The disadvantage is the limited mining of the PLL score sets without capturing its distributional information. In this paper, we represent a PLL score set as a Gaussian distribution and use Kullback-Leibler (KL) divergence and Jensen–Shannon (JS) divergence to construct evaluation measures for the distributions of stereotypical and anti-stereotypical PLL scores. Experimental results on the publicly available datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously. \ No newline at end of file diff --git a/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification b/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification new file mode 100644 index 0000000000..2ba6b87dc1 --- /dev/null +++ b/data/2024/aaai/Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification @@ -0,0 +1 @@ +Few-shot named entity recognition (NER) aims to recognize novel named entities in low-resource domains utilizing existing knowledge. However, the present few-shot NER models assume that the labeled data are all clean without noise or outliers, and there are few works focusing on the robustness of the cross-domain transfer learning ability to textual adversarial attacks in Few-shot NER. In this work, we comprehensively explore and assess the robustness of few-shot NER models under textual adversarial attack scenario, and found the vulnerability of existing few-shot NER models. Furthermore, we propose a robust two-stage few-shot NER method with Boundary Discrimination and Correlation Purification (BDCP). Specifically, in the span detection stage, the entity boundary discriminative module is introduced to provide a highly distinguishing boundary representation space to detect entity spans. In the entity typing stage, the correlations between entities and contexts are purified by minimizing the interference information and facilitating correlation generalization to alleviate the perturbations caused by textual adversarial attacks. In addition, we construct adversarial examples for few-shot NER based on public datasets Few-NERD and Cross-Dataset. Comprehensive evaluations on those two groups of few-shot NER datasets containing adversarial examples demonstrate the robustness and superiority of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels b/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels new file mode 100644 index 0000000000..b3a50bdfe0 --- /dev/null +++ b/data/2024/aaai/Robust Loss Functions for Training Decision Trees with Noisy Labels @@ -0,0 +1 @@ +We consider training decision trees using noisily labeled data, focusing on loss functions that can lead to robust learning algorithms. Our contributions are threefold. First, we offer novel theoretical insights on the robustness of many existing loss functions in the context of decision tree learning. We show that some of the losses belong to a class of what we call conservative losses, and the conservative losses lead to an early stopping behavior during training and noise-tolerant predictions during testing. Second, we introduce a framework for constructing robust loss functions, called distribution losses. These losses apply percentile-based penalties based on an assumed margin distribution, and they naturally allow adapting to different noise rates via a robustness parameter. In particular, we introduce a new loss called the negative exponential loss, which leads to an efficient greedy impurity-reduction learning algorithm. Lastly, our experiments on multiple datasets and noise settings validate our theoretical insight and the effectiveness of our adaptive negative exponential loss. \ No newline at end of file diff --git a/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise b/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise new file mode 100644 index 0000000000..7f58c3e87a --- /dev/null +++ b/data/2024/aaai/Robust Node Classification on Graph Data with Graph and Label Noise @@ -0,0 +1 @@ +Current research for node classification focuses on dealing with either graph noise or label noise, but few studies consider both of them. In this paper, we propose a new robust node classification method to simultaneously deal with graph noise and label noise. To do this, we design a graph contrastive loss to conduct local graph learning and employ self-attention to conduct global graph learning. They enable us to improve the expressiveness of node representation by using comprehensive information among nodes. We also utilize pseudo graphs and pseudo labels to deal with graph noise and label noise, respectively. Furthermore, We numerically validate the superiority of our method in terms of robust node classification compared with all comparison methods. \ No newline at end of file diff --git a/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack b/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack new file mode 100644 index 0000000000..7f52f048ae --- /dev/null +++ b/data/2024/aaai/Robust Nonparametric Regression under Poisoning Attack @@ -0,0 +1 @@ +This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to q samples from a training dataset of size N. Our initial solution is an M-estimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, supremum error is minimax optimal. The L2 error is optimal with relatively small q, but is suboptimal with larger q. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary q, up to a logarithmic factor. \ No newline at end of file diff --git a/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion b/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion new file mode 100644 index 0000000000..e224763d2d --- /dev/null +++ b/data/2024/aaai/Robust Policy Learning via Offline Skill Diffusion @@ -0,0 +1,2 @@ +Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets' domains. In this paper, we present a novel offline skill learning framework DuSkill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our DuSkill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains. +Through experiments, we show that DuSkill outperforms other skill-based imitation learning and RL algorithms for several long-horizon tasks, demonstrating its benefits in few-shot imitation and online RL. \ No newline at end of file diff --git a/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations b/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations new file mode 100644 index 0000000000..75ae62ced3 --- /dev/null +++ b/data/2024/aaai/Robust Stochastic Graph Generator for Counterfactual Explanations @@ -0,0 +1 @@ +Counterfactual Explanation (CE) techniques have garnered attention as a means to provide insights to the users engaging with AI systems. While extensively researched in domains such as medical imaging and autonomous vehicles, Graph Counterfactual Explanation (GCE) methods have been comparatively under-explored. GCEs generate a new graph similar to the original one, with a different outcome grounded on the underlying predictive model. Among these GCE techniques, those rooted in generative mechanisms have received relatively limited investigation despite demonstrating impressive accomplishments in other domains, such as artistic styles and natural language modelling. The preference for generative explainers stems from their capacity to generate counterfactual instances during inference, leveraging autonomously acquired perturbations of the input graph. Motivated by the rationales above, our study introduces RSGG-CE, a novel Robust Stochastic Graph Generator for Counterfactual Explanations able to produce counterfactual examples from the learned latent space considering a partially ordered generation sequence. Furthermore, we undertake quantitative and qualitative analyses to compare RSGG-CE's performance against SoA generative explainers, highlighting its increased ability to engendering plausible counterfactual candidates. \ No newline at end of file diff --git a/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning b/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning new file mode 100644 index 0000000000..78f1d76be0 --- /dev/null +++ b/data/2024/aaai/Robust Test-Time Adaptation for Zero-Shot Prompt Tuning @@ -0,0 +1 @@ +CLIP has demonstrated remarkable generalization across diverse downstream tasks. By aligning images and texts in a shared feature space, they enable zero-shot classification via hand-crafted prompts. However, recent studies have shown that hand-crafted prompts may be unsuitable in practical applications. Specifically, choosing an appropriate prompt for a given task requires accurate data and knowledge, which may not be obtainable in practical situations. An inappropriate prompt can result in poor performance. Moreover, if there is no training data, tuning prompts arbitrarily through unlabeled test data may lead to serious performance degradation when giving hand-crafted prompts. Our study reveals that the aforementioned problems are mainly due to the biases in testing data (Data Bias) and pre-trained CLIP model (Model Bias). The Data Bias makes it challenging to choose an appropriate prompt, while Model Bias renders some predictions inaccurate and biased, which leads to error accumulation. To address these biases, we propose robust test-time Adaptation for zeroshot Prompt tuning (ADAPROMPT). Specifically, we ensemble multiple prompts to avoid the worst-case results and dynamically tune prompts to adapt to Data Bias during testing. Furthermore, we adopt a confidence-aware buffer to store balanced and confident unlabeled test data to tune prompts in order to overcome Model Bias. Our extensive experiments on several benchmarks demonstrate that ADAPROMPT alleviates model bias, adapts to data bias and mostly outperforms the state-of-the-art methods at a small time cost. Moreover, our experimental results reveal that ADAPROMPT hardly encounters any performance degradation on these datasets. \ No newline at end of file diff --git a/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction b/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction new file mode 100644 index 0000000000..da1358c9e3 --- /dev/null +++ b/data/2024/aaai/Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction @@ -0,0 +1 @@ +Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model’s confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over comparable UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple. The MC-CP code and replication package is available at https://github.com/team-daniel/MC-CP. \ No newline at end of file diff --git a/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations b/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations new file mode 100644 index 0000000000..59a4fff7c8 --- /dev/null +++ b/data/2024/aaai/Robust Visual Imitation Learning with Inverse Dynamics Representations @@ -0,0 +1 @@ +Imitation learning (IL) has achieved considerable success in solving complex sequential decision-making problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods. \ No newline at end of file diff --git a/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data b/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data new file mode 100644 index 0000000000..613279df81 --- /dev/null +++ b/data/2024/aaai/Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data @@ -0,0 +1 @@ +Learning from open-world noisy data, where both closed-set and open-set noise co-exist in the dataset, is a realistic but underexplored setting. Only recently, several efforts have been initialized to tackle this problem. However, these works assume the classes are balanced when dealing with open-world noisy data. This assumption often violates the nature of real-world large-scale datasets, where the label distributions are generally long-tailed, i.e. class-imbalanced. In this paper, we study the problem of robust visual recognition with class-imbalanced open-world noisy data. We propose a probabilistic graphical model-based approach: iMRF to achieve label noise correction that is robust to class imbalance via an efficient iterative inference of a Markov Random Field (MRF) in each training mini-batch. Furthermore, we design an agreement-based thresholding strategy to adaptively collect clean samples from all classes that includes corrected closed-set noisy samples while rejecting open-set noisy samples. We also introduce a noise-aware balanced cross-entropy loss to explicitly eliminate the bias caused by class-imbalanced data. Extensive experiments on several benchmark datasets including synthetic and real-world noisy datasets demonstrate the superior performance robustness of our method over existing methods. Our code is available at https://github.com/Na-Z/LIOND. \ No newline at end of file diff --git a/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach b/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach new file mode 100644 index 0000000000..2750135c27 --- /dev/null +++ b/data/2024/aaai/Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach @@ -0,0 +1,3 @@ +This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the +ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help +consistently reduce the asymptotic regret. \ No newline at end of file diff --git a/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization b/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization new file mode 100644 index 0000000000..3034ed69ae --- /dev/null +++ b/data/2024/aaai/Robustly Train Normalizing Flows via KL Divergence Regularization @@ -0,0 +1 @@ +In this paper, we find that the training of Normalizing Flows (NFs) are easily affected by the outliers and a small number (or high dimensionality) of training samples. To solve this problem, we propose a Kullback–Leibler (KL) divergence regularization on the Jacobian matrix of NFs. We prove that such regularization is equivalent to adding a set of samples whose covariance matrix is the identity matrix to the training set. Thus, it reduces the negative influence of the outliers and the small sample number on the estimation of the covariance matrix, simultaneously. Therefore, our regularization makes the training of NFs robust. Ultimately, we evaluate the performance of NFs on out-of-distribution (OoD) detection tasks. The excellent results obtained demonstrate the effectiveness of the proposed regularization term. For example, with the help of the proposed regularization, the OoD detection score increases at most 30% compared with the one without the regularization. \ No newline at end of file diff --git a/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles b/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles new file mode 100644 index 0000000000..055c177f78 --- /dev/null +++ b/data/2024/aaai/Robustness Verification of Multi-Class Tree Ensembles @@ -0,0 +1,3 @@ +Tree ensembles are one of the most widely used model classes. +However, these models are susceptible to adversarial examples, which are slightly perturbed examples that elicit a misprediction. +There has been significant research on designing approaches to verify the robustness of tree ensembles to such attacks. However, existing verification algorithms for tree ensembles are only able to analyze binary classifiers and hence address multiclass problems by reducing them to binary ones using a one-versus-other strategy. In this paper, we show that naively applying this strategy can yield incorrect results in certain situations. We address this shortcoming by proposing a novel approximate heuristic approach to verification for multiclass tree ensembles. Our approach is based on a novel generalization of the verification task, which we show emits other relevant verification queries. \ No newline at end of file diff --git a/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning b/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning new file mode 100644 index 0000000000..70cb3ccbc4 --- /dev/null +++ b/data/2024/aaai/Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning @@ -0,0 +1 @@ +We present a generic Reinforcement Learning (RL) framework optimized for crafting adversarial attacks on different model types spanning from ECG signal analysis (1D), image classification (2D), and video classification (3D). The framework focuses on identifying sensitive regions and inducing misclassifications with minimal distortions and various distortion types. The novel RL method outperforms state-of-the-art methods for all three applications, proving its efficiency. Our RL approach produces superior localization masks, enhancing interpretability for image classification and ECG analysis models. For applications such as ECG analysis, our platform highlights critical ECG segments for clinicians while ensuring resilience against prevalent distortions. This comprehensive tool aims to bolster both resilience with adversarial training and transparency across varied applications and data types. \ No newline at end of file diff --git a/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization b/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization new file mode 100644 index 0000000000..b24b87ea05 --- /dev/null +++ b/data/2024/aaai/Robustness-Guided Image Synthesis for Data-Free Quantization @@ -0,0 +1 @@ +Quantization has emerged as a promising direction for model compression. Recently, data-free quantization has been widely studied as a promising method to avoid privacy concerns, which synthesizes images as an alternative to real training data. Existing methods use classification loss to ensure the reliability of the synthesized images. Unfortunately, even if these images are well-classified by the pre-trained model, they still suffer from low semantics and homogenization issues. Intuitively, these low-semantic images are sensitive to perturbations, and the pre-trained model tends to have inconsistent output when the generator synthesizes an image with low semantics. To this end, we propose Robustness-Guided Image Synthesis (RIS), a simple but effective method to enrich the semantics of synthetic images and improve image diversity, further boosting the performance of data-free compression tasks. Concretely, we first introduce perturbations on input and model weight, then define the inconsistency metrics at feature and prediction levels before and after perturbations. On the basis of inconsistency on two levels, we design a robustness optimization objective to eliminate low-semantic images. Moreover, we also make our approach diversity-aware by forcing the generator to synthesize images with small correlations. With RIS, we achieve state-of-the-art performance for various settings on data-free quantization and can be extended to other data-free compression tasks. \ No newline at end of file diff --git a/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning b/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning new file mode 100644 index 0000000000..5238a7ce2f --- /dev/null +++ b/data/2024/aaai/Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning @@ -0,0 +1 @@ +While semi-supervised learning (SSL) has yielded promising results, the more realistic SSL scenario remains to be explored, in which the unlabeled data exhibits extremely high recognition difficulty, e.g., fine-grained visual classification in the context of SSL (SS-FGVC). The increased recognition difficulty on fine-grained unlabeled data spells disaster for pseudo-labeling accuracy, resulting in poor performance of the SSL model. To tackle this challenge, we propose Soft Label Selection with Confidence-Aware Clustering based on Class Transition Tracking (SoC) by reconstructing the pseudo-label selection process by jointly optimizing Expansion Objective and Shrinkage Objective, which is based on a soft label manner. Respectively, the former objective encourages soft labels to absorb more candidate classes to ensure the attendance of ground-truth class, while the latter encourages soft labels to reject more noisy classes, which is theoretically proved to be equivalent to entropy minimization. In comparisons with various state-of-the-art methods, our approach demonstrates its superior performance in SS-FGVC. Checkpoints and source code are available at https://github.com/NJUyued/SoC4SS-FGVC. \ No newline at end of file diff --git a/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation b/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation new file mode 100644 index 0000000000..f6850e7a81 --- /dev/null +++ b/data/2024/aaai/Rolling-Unet: Revitalizing MLP's Ability to Efficiently Extract Long-Distance Dependencies for Medical Image Segmentation @@ -0,0 +1 @@ +Medical image segmentation methods based on deep learning network are mainly divided into CNN and Transformer. However, CNN struggles to capture long-distance dependencies, while Transformer suffers from high computational complexity and poor local feature learning. To efficiently extract and fuse local features and long-range dependencies, this paper proposes Rolling-Unet, which is a CNN model combined with MLP. Specifically, we propose the core R-MLP module, which is responsible for learning the long-distance dependency in a single direction of the whole image. By controlling and combining R-MLP modules in different directions, OR-MLP and DOR-MLP modules are formed to capture long-distance dependencies in multiple directions. Further, Lo2 block is proposed to encode both local context information and long-distance dependencies without excessive computational burden. Lo2 block has the same parameter size and computational complexity as a 3×3 convolution. The experimental results on four public datasets show that Rolling-Unet achieves superior performance compared to the state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms b/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms new file mode 100644 index 0000000000..40b67fd768 --- /dev/null +++ b/data/2024/aaai/Root Cause Explanation of Outliers under Noisy Mechanisms @@ -0,0 +1 @@ +Identifying root causes of anomalies in causal processes is vital across disciplines. Once identified, one can isolate the root causes and implement necessary measures to restore the normal operation. Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge. Existing work only consider the contribution of nodes in the generative process, thus can not attribute the outlier score to the edges of the mechanism if the anomaly occurs in the connections. In this paper, we consider both individual edge and node of each mechanism when identifying the root causes. We introduce a noisy functional causal model to account for this purpose. Then, we employ Bayesian learning and inference methods to infer the noises of the nodes and edges. We then represent the functional form of a target outlier leaf as a function of the node and edge noises. Finally, we propose an efficient gradient-based attribution method to compute the anomaly attribution scores which scales linearly with the number of nodes and edges. Experiments on simulated datasets and two real-world scenario datasets show better anomaly attribution performance of the proposed method compared to the baselines. Our method scales to larger graphs with more nodes and edges. \ No newline at end of file diff --git "a/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations" "b/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations" new file mode 100644 index 0000000000..bbba4cb3db --- /dev/null +++ "b/data/2024/aaai/Runtime Analysis of the (\316\274 + 1) GA: Provable Speed-Ups from Strong Drift towards Diverse Populations" @@ -0,0 +1,3 @@ +Most evolutionary algorithms used in practice heavily employ crossover. In contrast, the rigorous understanding of how crossover is beneficial is largely lagging behind. In this work, we make a considerable step forward by analyzing the population dynamics of the (µ+1) genetic algorithm when optimizing the Jump benchmark. We observe (and prove via mathematical means) that once the population contains two different individuals on the local optimum, the diversity in the population increases in expectation. From this drift towards more diverse states, we show that a diversity suitable for crossover to be effective is reached quickly and, more importantly, then persists for a time that is at least exponential in the population size µ. This drastically improves over the previously best known guarantee, which is only quadratic in µ. + +Our new understanding of the population dynamics easily gives stronger performance guarantees. In particular, we derive that population sizes logarithmic in the problem size n suffice to gain an Ω(n)-factor runtime improvement from crossover (previous works achieved comparable bounds only with µ = Θ(n) or a non-standard mutation rate). \ No newline at end of file diff --git a/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization b/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization new file mode 100644 index 0000000000..11543b7e3b --- /dev/null +++ b/data/2024/aaai/Runtime Analysis of the SMS-EMOA for Many-Objective Optimization @@ -0,0 +1,5 @@ +The widely used multiobjective optimizer NSGA-II was recently proven to have considerable difficulties in many-objective optimization. In contrast, experimental results in the literature show a good performance of the SMS-EMOA, which can be seen as a steady-state NSGA-II that uses the hypervolume contribution instead of the crowding distance as the second selection criterion. + +This paper conducts the first rigorous runtime analysis of the SMS-EMOA for many-objective optimization. To this aim, we first propose a many-objective counterpart, the m-objective mOJZJ problem, of the bi-objective OJZJ benchmark, which is the first many-objective multimodal benchmark used in a mathematical runtime analysis. We prove that SMS-EMOA computes the full Pareto front of this benchmark in an expected number of O(M^2 n^k) iterations, where n denotes the problem size (length of the bit-string representation), k the gap size (a difficulty parameter of the problem), and M=(2n/m-2k+3)^(m/2) the size of the Pareto front. This result together with the existing negative result on the original NSGA-II shows that in principle, the general approach of the NSGA-II is suitable for many-objective optimization, but the crowding distance as tie-breaker has deficiencies. + +We obtain three additional insights on the SMS-EMOA. Different from a recent result for the bi-objective OJZJ benchmark, the stochastic population update often does not help for mOJZJ. It results in a 1/Θ(min(Mk^(1/2)/2^(k/2),1)) speed-up, which is Θ(1) for large m such as m>k. On the positive side, we prove that heavy-tailed mutation still results in a speed-up of order k^(0.5+k-β). Finally, we conduct the first runtime analyses of the SMS-EMOA on the bi-objective OneMinMax and LOTZ benchmarks and show that it has a performance comparable to the GSEMO and the NSGA-II. \ No newline at end of file diff --git a/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs b/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs new file mode 100644 index 0000000000..c772f72128 --- /dev/null +++ b/data/2024/aaai/Runtime vs. Extracted Proof Size: An Exponential Gap for CDCL on QBFs @@ -0,0 +1,3 @@ +Conflict-driven clause learning (CDCL) is the dominating algorithmic paradigm for SAT solving and hugely successful in practice. In its lifted version QCDCL, it is one of the main approaches for solving quantified Boolean formulas (QBF). + +In both SAT and QBF, proofs can be efficiently extracted from runs of (Q)CDCL solvers. While for CDCL, it is known that the proof size in the underlying proof system propositional resolution matches the CDCL runtime up to a polynomial factor, we show that in QBF there is an exponential gap between QCDCL runtime and the size of the extracted proofs in QBF resolution systems. We demonstrate that this is not just a gap between QCDCL runtime and the size of any QBF resolution proof, but even the extracted proofs are exponentially smaller for some instances. Hence searching for a small proof via QCDCL (even with non-deterministic decision policies) will provably incur an exponential overhead for some instances. \ No newline at end of file diff --git a/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution b/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution new file mode 100644 index 0000000000..8e44522d09 --- /dev/null +++ b/data/2024/aaai/S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution @@ -0,0 +1 @@ +Hyperspectral image super-resolution (HISR) is a technique that can break through the limitation of imaging mechanism to obtain the hyperspectral image (HSI) with high spatial resolution. Although some progress has been achieved by existing methods, most of them directly learn the spatial-spectral joint mapping between the observed images and the target high-resolution HSI (HrHSI), failing to fully reserve the spectral distribution of low-resolution HSI (LrHSI) and the spatial distribution of high-resolution multispectral imagery (HrMSI). To this end, we propose a spatial-spectral-bilateral cycle-diffusion framework (S2CycleDiff) for HISR, which can step-wise generate the HrHSI with high spatial-spectral fidelity by learning the conditional distribution of spatial and spectral super-resolution processes bilaterally. Specifically, a customized conditional cycle-diffusion framework is designed as the backbone to achieve the spatial-spectral-bilateral super-resolution by repeated refinement, wherein the spatial/spectral guided pyramid denoising (SGPD) module seperately takes HrMSI and LrHSI as the guiding factors to achieve the spatial details injection and spectral correction. The outputs of the conditional cycle-diffusion framework are fed into a complementary fusion block to integrate the spatial and spectral details to generate the desired HrHSI. Experiments have been conducted on three widely used datasets to demonstrate the superiority of the proposed method over state-of-the-art HISR methods. The code is available at https://github.com/Jiahuiqu/S2CycleDiff. \ No newline at end of file diff --git a/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention b/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention new file mode 100644 index 0000000000..6069dc624c --- /dev/null +++ b/data/2024/aaai/S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention @@ -0,0 +1 @@ +Transformer's recent integration into style transfer leverages its proficiency in establishing long-range dependencies, albeit at the expense of attenuated local modeling. This paper introduces Strips Window Attention Transformer (S2WAT), a novel hierarchical vision transformer designed for style transfer. S2WAT employs attention computation in diverse window shapes to capture both short- and long-range dependencies. The merged dependencies utilize the "Attn Merge" strategy, which adaptively determines spatial weights based on their relevance to the target. Extensive experiments on representative datasets show the proposed method's effectiveness compared to state-of-the-art (SOTA) transformer-based and other approaches. The code and pre-trained models are available at https://github.com/AlienZhang1996/S2WAT. \ No newline at end of file diff --git a/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment b/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment new file mode 100644 index 0000000000..b496b31d22 --- /dev/null +++ b/data/2024/aaai/S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment @@ -0,0 +1 @@ +Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A. \ No newline at end of file diff --git a/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder b/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder new file mode 100644 index 0000000000..18f9dc75b6 --- /dev/null +++ b/data/2024/aaai/SALSA: Semantically-Aware Latent Space Autoencoder @@ -0,0 +1 @@ +In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity. \ No newline at end of file diff --git a/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction b/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction new file mode 100644 index 0000000000..cadfb8d365 --- /dev/null +++ b/data/2024/aaai/SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction @@ -0,0 +1 @@ +Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by approximately 290 times compared with current parameter-efficient fine-tuning methods. \ No newline at end of file diff --git a/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks b/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks new file mode 100644 index 0000000000..c09b64bcab --- /dev/null +++ b/data/2024/aaai/SAME: Sample Reconstruction against Model Extraction Attacks @@ -0,0 +1 @@ +While deep learning models have shown significant performance across various domains, their deployment needs extensive resources and advanced computing infrastructure. As a solution, Machine Learning as a Service (MLaaS) has emerged, lowering the barriers for users to release or productize their deep learning models. However, previous studies have highlighted potential privacy and security concerns associated with MLaaS, and one primary threat is model extraction attacks. To address this, there are many defense solutions but they suffer from unrealistic assumptions and generalization issues, making them less practical for reliable protection. Driven by these limitations, we introduce a novel defense mechanism, SAME, based on the concept of sample reconstruction. This strategy imposes minimal prerequisites on the defender's capabilities, eliminating the need for auxiliary Out-of-Distribution (OOD) datasets, user query history, white-box model access, and additional intervention during model training. It is compatible with existing active defense methods. Our extensive experiments corroborate the superior efficacy of SAME over state-of-the-art solutions. Our code is available at https://github.com/xythink/SAME. \ No newline at end of file diff --git a/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model b/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model new file mode 100644 index 0000000000..52699cb8ce --- /dev/null +++ b/data/2024/aaai/SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model @@ -0,0 +1 @@ +Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass. \ No newline at end of file diff --git a/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching b/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching new file mode 100644 index 0000000000..432761a817 --- /dev/null +++ b/data/2024/aaai/SAT-Based Algorithms for Regular Graph Pattern Matching @@ -0,0 +1,3 @@ +Graph matching is a fundamental problem in pattern recognition, with many applications such as software analysis and computational biology. One well-known type of graph matching problem is graph isomorphism, which consists of deciding if two graphs are identical. Despite its usefulness, the properties that one may check using graph isomorphism are rather limited, since it only allows strict equality checks between two graphs. For example, it does not allow one to check complex structural properties such as if the target graph is an arbitrary length sequence followed by an arbitrary size loop. + +We propose a generalization of graph isomorphism that allows one to check such properties through a declarative specification. This specification is given in the form of a Regular Graph Pattern (ReGaP), a special type of graph, inspired by regular expressions, that may contain wildcard nodes that represent arbitrary structures such as variable-sized sequences or subgraphs. We propose a SAT-based algorithm for checking if a target graph matches a given ReGaP. We also propose a preprocessing technique for improving the performance of the algorithm and evaluate it through an extensive experimental evaluation on benchmarks from the CodeSearchNet dataset. \ No newline at end of file diff --git a/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models b/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models new file mode 100644 index 0000000000..f3d1712738 --- /dev/null +++ b/data/2024/aaai/SAT-Based Techniques for Lexicographically Smallest Finite Models @@ -0,0 +1,3 @@ +This paper proposes SAT-based techniques to calculate a specific normal form of a given finite mathematical structure (model). The normal form is obtained by permuting the domain elements so that the representation of the structure is lexicographically smallest possible. Such a normal form is of interest to mathematicians as it enables easy cataloging of algebraic structures. In particular, two structures are isomorphic precisely when their normal forms are the same. This form is also natural to inspect as mathematicians have been using it routinely for many decades. + +We develop a novel approach where a SAT solver is used in a black-box fashion to compute the smallest representative. The approach constructs the representative gradually and searches the space of possible isomorphisms, requiring a small number of variables. However, the approach may lead to a large number of SAT calls and therefore we devise propagation techniques to reduce this number. The paper focuses on finite structures with a single binary operation (encompassing groups, semigroups, etc.). However, the approach is generalizable to arbitrary finite structures. We provide an implementation of the proposed algorithm and evaluate it on a variety of algebraic structures. \ No newline at end of file diff --git a/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection b/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection new file mode 100644 index 0000000000..b8363f5839 --- /dev/null +++ b/data/2024/aaai/SAT-Based Tree Decomposition with Iterative Cascading Policy Selection @@ -0,0 +1,5 @@ +Solvers for propositional satisfiability (SAT) effectively tackle hard optimization problems. However, translating to SAT can cause a significant size increase, restricting its use to smaller instances. To mitigate this, frameworks using multiple local SAT calls for gradually improving a heuristic solution have been proposed. The performance of such algorithmic frameworks heavily relies on critical parameters, including the size of selected local instances and the time allocated per SAT call. + +This paper examines the automated configuration of the treewidth SAT-based local improvement method (TW-SLIM) framework, which uses multiple SAT calls for computing tree decompositions of small width, a fundamental problem in combinatorial optimization. We explore various TW-SLIM configuration methods, including offline learning and real-time adjustments, significantly outperforming default settings in multi-SAT scenarios with changing problems. + +Building upon insights gained from offline training and real-time configurations for TW-SLIM, we propose the iterative cascading policy—a novel hybrid technique that uniquely combines both. The iterative cascading policy employs a pool of 30 configurations obtained through clustering-based offline methods, deploying them in dynamic cascades across multiple rounds. In each round, the 30 configurations are tested according to the cascading ordering, and the best tree decomposition is retained for further improvement, with the option to adjust the following ordering of cascades. This iterative approach significantly enhances the performance of TW-SLIM beyond baseline results, even within varying global timeouts. This highlights the effectiveness of the proposed iterative cascading policy in enhancing the efficiency and efficacy of complex algorithmic frameworks like TW-SLIM. \ No newline at end of file diff --git a/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection b/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection new file mode 100644 index 0000000000..6ee8211579 --- /dev/null +++ b/data/2024/aaai/SAUI: Scale-Aware Unseen Imagineer for Zero-Shot Object Detection @@ -0,0 +1 @@ +Zero-shot object detection (ZSD) aims to localize and classify unseen objects without access to their training annotations. As a prevailing solution to ZSD, generation-based methods synthesize unseen visual features by taking seen features as reference and class semantic embeddings as guideline. Although previous works continuously improve the synthesis quality, they fail to consider the scale-varying nature of unseen objects. The generation process is preformed over a single scale of object features and thus lacks scale-diversity among synthesized features. In this paper, we reveal the scale-varying challenge in ZSD and propose a Scale-Aware Unseen Imagineer (SAUI) to lead the way of a novel scale-aware ZSD paradigm. To obtain multi-scale features of seen-class objects, we design a specialized coarse-to-fine extractor to capture features through multiple scale-views. To generate unseen features scale by scale, we innovate a Series-GAN synthesizer along with three scale-aware contrastive components to imagine separable, diverse and robust scale-wise unseen features. Extensive experiments on PASCAL VOC, COCO and DIOR datasets demonstrate SAUI's better performance in different scenarios, especially for scale-varying and small objects. Notably, SAUI achieves the new state-of-the art performance on COCO and DIOR. \ No newline at end of file diff --git a/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network b/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network new file mode 100644 index 0000000000..0202be7743 --- /dev/null +++ b/data/2024/aaai/SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network @@ -0,0 +1 @@ +Deep learning-based video super-resolution (VSR) networks have gained significant performance improvements in recent years. However, existing VSR networks can only support a fixed integer scale super-resolution task, and when we want to perform VSR at multiple scales, we need to train several models. This implementation certainly increases the consumption of computational and storage resources, which limits the application scenarios of VSR techniques. In this paper, we propose a novel Scale-adaptive Arbitrary-scale Video Super-Resolution network (SAVSR), which is the first work focusing on spatial VSR at arbitrary scales including both non-integer and asymmetric scales. We also present an omni-dimensional scale-attention convolution, which dynamically adapts according to the scale of the input to extract inter-frame features with stronger representational power. Moreover, the proposed spatio-temporal adaptive arbitrary-scale upsampling performs VSR tasks using both temporal features and scale information. And we design an iterative bi-directional architecture for implicit feature alignment. Experiments at various scales on the benchmark datasets show that the proposed SAVSR outperforms state-of-the-art (SOTA) methods at non-integer and asymmetric scales. The source code is available at https://github.com/Weepingchestnut/SAVSR. \ No newline at end of file diff --git "a/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt" "b/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt" new file mode 100644 index 0000000000..e9e52a659f --- /dev/null +++ "b/data/2024/aaai/SA\302\262VP: Spatially Aligned-and-Adapted Visual Prompt" @@ -0,0 +1 @@ +As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the Spatially Aligned-and-Adapted Visual Prompt model (SA^2VP), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our SA^2VP is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at https://github.com/tommy-xq/SA2VP. \ No newline at end of file diff --git a/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views b/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views new file mode 100644 index 0000000000..41389a34d1 --- /dev/null +++ b/data/2024/aaai/SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views @@ -0,0 +1 @@ +The recent neural surface reconstruction approaches using volume rendering have made much progress by achieving impressive surface reconstruction quality, but are still limited to dense and highly accurate posed views. To overcome such drawbacks, this paper pays special attention on the consistent surface reconstruction from sparse views with noisy camera poses. Unlike previous approaches, the key difference of this paper is to exploit the multi-view constraints directly from the explicit geometry of the neural surface, which can be used as effective regularization to jointly learn the neural surface and refine the camera poses. To build effective multi-view constraints, we introduce a fast differentiable on-surface intersection to generate on-surface points, and propose view-consistent losses on such differentiable points to regularize the neural surface learning. Based on this point, we propose a joint learning strategy, named SC-NeuS, to perform geometry-consistent surface reconstruction in an end-to-end manner. With extensive evaluation on public datasets, our SC-NeuS can achieve consistently better surface reconstruction results with fine-grained details than previous approaches, especially from sparse and noisy camera views. The source code is available at https://github.com/zouzx/sc-neus.git. \ No newline at end of file diff --git a/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition b/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition new file mode 100644 index 0000000000..22053342a7 --- /dev/null +++ b/data/2024/aaai/SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition @@ -0,0 +1 @@ +Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly. Our code and supplementary material can be found at https://github.com/cong-wu/SCD-Net. \ No newline at end of file diff --git a/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression b/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression new file mode 100644 index 0000000000..2e085ed20f --- /dev/null +++ b/data/2024/aaai/SCP: Spherical-Coordinate-Based Learned Point Cloud Compression @@ -0,0 +1 @@ +In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a model-agnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to fully leverage the features of circular shapes and azimuthal angle invariance. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate. \ No newline at end of file diff --git a/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation b/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation new file mode 100644 index 0000000000..48b0fb7363 --- /dev/null +++ b/data/2024/aaai/SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation @@ -0,0 +1 @@ +Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet. \ No newline at end of file diff --git a/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization b/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization new file mode 100644 index 0000000000..2a53ba60ca --- /dev/null +++ b/data/2024/aaai/SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization @@ -0,0 +1 @@ +In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption. \ No newline at end of file diff --git a/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving b/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving new file mode 100644 index 0000000000..01d7246ccb --- /dev/null +++ b/data/2024/aaai/SDAC: A Multimodal Synthetic Dataset for Anomaly and Corner Case Detection in Autonomous Driving @@ -0,0 +1 @@ +Nowadays, closed-set perception methods for autonomous driving perform well on datasets containing normal scenes. However, they still struggle to handle anomalies in the real world, such as unknown objects that have never been seen while training. The lack of public datasets to evaluate the model performance on anomaly and corner cases has hindered the development of reliable autonomous driving systems. Therefore, we propose a multimodal Synthetic Dataset for Anomaly and Corner case detection, called SDAC, which encompasses anomalies captured from multi-view cameras and the LiDAR sensor, providing a rich set of annotations for multiple mainstream perception tasks. SDAC is the first public dataset for autonomous driving that categorizes anomalies into object, scene, and scenario levels, allowing the evaluation under different anomalous conditions. Experiments show that closed-set models suffer significant performance drops on anomaly subsets in SDAC. Existing anomaly detection methods fail to achieve satisfactory performance, suggesting that anomaly detection remains a challenging problem. We anticipate that our SDAC dataset could foster the development of safe and reliable systems for autonomous driving. \ No newline at end of file diff --git a/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing b/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing new file mode 100644 index 0000000000..cd24a9e490 --- /dev/null +++ b/data/2024/aaai/SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing @@ -0,0 +1 @@ +Facial attribute editing has garnered significant attention, yet prevailing methods struggle with achieving precise attribute manipulation while preserving irrelevant details and controlling attribute styles. This challenge primarily arises from the strong correlations between different attributes and the interplay between attributes and identity. In this paper, we propose Semantic Disentangled GAN (SDGAN), a novel method addressing this challenge. SDGAN introduces two key concepts: a semantic disentanglement generator that assigns facial representations to distinct attribute-specific editing modules, enabling the decoupling of the facial attribute editing process, and a semantic mask alignment strategy that confines attribute editing to appropriate regions, thereby avoiding undesired modifications. Leveraging these concepts, SDGAN demonstrates accurate attribute editing and achieves high-quality attribute style manipulation through both latent-guided and reference-guided manners. We extensively evaluate our method on the CelebA-HQ database, providing both qualitative and quantitative analyses. Our results establish that SDGAN significantly outperforms state-of-the-art techniques, showcasing the effectiveness of our approach. To foster reproducibility and further research, we will provide the code for our method. \ No newline at end of file diff --git a/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning b/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning new file mode 100644 index 0000000000..b2e212ab0d --- /dev/null +++ b/data/2024/aaai/SDGMNet: Statistic-Based Dynamic Gradient Modulation for Local Descriptor Learning @@ -0,0 +1 @@ +Rescaling the backpropagated gradient of contrastive loss has made significant progress in descriptor learning. However, current gradient modulation strategies have no regard for the varying distribution of global gradients, so they would suffer from changes in training phases or datasets. In this paper, we propose a dynamic gradient modulation, named SDGMNet, for contrastive local descriptor learning. The core of our method is formulating modulation functions with dynamically estimated statistical characteristics. Firstly, we introduce angle for distance measure after deep analysis on backpropagation of pair-wise loss. On this basis, auto-focus modulation is employed to moderate the impact of statistically uncommon individual pairs in stochastic gradient descent optimization; probabilistic margin cuts off the gradients of proportional triplets that have achieved enough optimization; power adjustment balances the total weights of negative pairs and positive pairs. Extensive experiments demonstrate that our novel descriptor surpasses previous state-of-the-art methods in several tasks including patch verification, retrieval, pose estimation, and 3D reconstruction. \ No newline at end of file diff --git a/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network b/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network new file mode 100644 index 0000000000..6fd2452cc6 --- /dev/null +++ b/data/2024/aaai/SEA-GWNN: Simple and Effective Adaptive Graph Wavelet Neural Network @@ -0,0 +1 @@ +The utilization of wavelet-based techniques in graph neural networks (GNNs) has gained considerable attention, particularly in the context of node classification. Although existing wavelet-based approaches have shown promise, they are constrained by their reliance on pre-defined wavelet filters, rendering them incapable of effectively adapting to signals that reside on graphs based on tasks at hand. Recent research endeavors address this issue through the introduction of a wavelet lifting transform. However, this technique necessitates the use of bipartite graphs, causing a transformation of the original graph structure into a bipartite configuration. This alteration of graph topology results in the generation of undesirable wavelet filters, thereby undermining the effectiveness of the method. In response to these challenges, we propose a novel simple and effective adaptive graph wavelet neural network (SEA-GWNN) class that employs the lifting scheme on arbitrary graph structures while upholding the original graph topology by leveraging multi-hop computation trees. A noteworthy aspect of the approach is the focus on local substructures represented as acyclic trees, wherein the lifting strategy is applied in a localized manner. This locally defined lifting scheme effectively combines high-pass and low-pass frequency information to enhance node representations. Furthermore, to reduce computing costs, we propose to decouple the higher- order lifting operators and induce them from the lower-order structures. Finally, we benchmark our model on several real- world datasets spanning four distinct categories, including citation networks, webpages, the film industry, and large-scale graphs and the experimental results showcase the efficacy of the proposed SEA-GWNN. \ No newline at end of file diff --git a/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy b/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy new file mode 100644 index 0000000000..4fa9a31149 --- /dev/null +++ b/data/2024/aaai/SEC: More Accurate Clustering Algorithm via Structural Entropy @@ -0,0 +1 @@ +As one of the most popular machine learning tools in the field of unsupervised learning, clustering has been widely used in various practical applications. While numerous methods have been proposed for clustering, a commonly encountered issue is that the existing clustering methods rely heavily on local neighborhood information during the optimization process, which leads to suboptimal performance on real-world datasets. Besides, most existing clustering methods use Euclidean distances or densities to measure the similarity between data points. This could constrain the effectiveness of the algorithms for handling datasets with irregular patterns. Thus, a key challenge is how to effectively capture the global structural information in clustering instances to improve the clustering quality. In this paper, we propose a new clustering algorithm, called SEC. This algorithm uses the global structural information extracted from an encoding tree to guide the clustering optimization process. Based on the relation between data points in the instance, a sparse graph of the clustering instance can be constructed. By leveraging the sparse graph constructed, we propose an iterative encoding tree method, where hierarchical abstractions of the encoding tree are iteratively extracted as new clustering features to obtain better clustering results. To avoid the influence of easily misclustered data points located on the boundaries of the clustering partitions, which we call "fringe points", we propose an iterative pre-deletion and reassignment technique such that the algorithm can delete and reassign the "fringe points" to obtain more resilient and precise clustering results. Empirical experiments on both synthetic and real-world datasets demonstrate that our proposed algorithm outperforms state-of-the-art clustering methods and achieves better clustering performances. On average, the clustering accuracy (ACC) is increased by 1.7% and the normalized mutual information (NMI) by 7.9% compared with the current state-of-the-art (SOTA) algorithm on synthetic datasets. On real-world datasets, our method outperforms other clustering methods with an average increase of 12.3% in ACC and 5.2% in NMI, respectively. \ No newline at end of file diff --git a/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model b/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model new file mode 100644 index 0000000000..aa1239724c --- /dev/null +++ b/data/2024/aaai/SECap: Speech Emotion Captioning with Large Language Model @@ -0,0 +1 @@ +Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately represent speech emotions. On the contrary, describing speech emotions directly by means of natural language may be a more effective approach. Regrettably, there are not many studies available that have focused on this direction. Therefore, this paper proposes a speech emotion captioning framework named SECap, aiming at effectively describing speech emotions using natural language. Owing to the impressive capabilities of large language models in language comprehension and text generation, SECap employs LLaMA as the text decoder to allow the production of coherent speech emotion captions. In addition, SECap leverages HuBERT as the audio encoder to extract general speech features and Q-Former as the Bridge-Net to provide LLaMA with emotion-related speech features. To accomplish this, Q-Former utilizes mutual information learning to disentangle emotion-related speech features and speech contents, while implementing contrastive learning to extract more emotion-related speech features. The results of objective and subjective evaluations demonstrate that: 1) the SECap framework outperforms the HTSAT-BART baseline in all objective evaluations; 2) SECap can generate high-quality speech emotion captions that attain performance on par with human annotators in subjective mean opinion score tests. \ No newline at end of file diff --git a/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly b/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly new file mode 100644 index 0000000000..48ad6139b7 --- /dev/null +++ b/data/2024/aaai/SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly @@ -0,0 +1 @@ +This paper proposes SEER, a novel backdoor detection algorithm for vision-language models, addressing the gap in the literature on multi-modal backdoor detection. While backdoor detection in single-modal models has been well studied, the investigation of such defenses in multi-modal models remains limited. Existing backdoor defense mechanisms cannot be directly applied to multi-modal settings due to their increased complexity and search space explosion. In this paper, we propose to detect backdoors in vision-language models by jointly searching image triggers and malicious target texts in feature space shared by vision and language modalities. Our extensive experiments demonstrate that SEER can achieve over 92% detection rate on backdoor detection in vision-language models in various settings without accessing training data or knowledge of downstream tasks. \ No newline at end of file diff --git a/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain b/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain new file mode 100644 index 0000000000..0a281c9fed --- /dev/null +++ b/data/2024/aaai/SEIT: Structural Enhancement for Unsupervised Image Translation in Frequency Domain @@ -0,0 +1 @@ +For the task of unsupervised image translation, transforming the image style while preserving its original structure remains challenging. In this paper, we propose an unsupervised image translation method with structural enhancement in frequency domain named SEIT. Specifically, a frequency dynamic adaptive (FDA) module is designed for image style transformation that can well transfer the image style while maintaining its overall structure by decoupling the image content and style in frequency domain. Moreover, a wavelet-based structure enhancement (WSE) module is proposed to improve the intermediate translation results by matching the high-frequency information, thus enriching the structural details. Furthermore, a multi-scale network architecture is designed to extract the domain-specific information using image-independent encoders for both the source and target domains. The extensive experimental results well demonstrate the effectiveness of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER b/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER new file mode 100644 index 0000000000..508a9f163d --- /dev/null +++ b/data/2024/aaai/SENCR: A Span Enhanced Two-Stage Network with Counterfactual Rethinking for Chinese NER @@ -0,0 +1 @@ +Recently, lots of works that incorporate external lexicon information into character-level Chinese named entity recognition(NER) to overcome the lackness of natural delimiters of words, have achieved many advanced performance. However, obtaining and maintaining high-quality lexicons is costly, especially in special domains. In addition, the entity boundary bias caused by high mention coverage in some boundary characters poses a significant challenge to the generalization of NER models but receives little attention in the existing literature. To address these issues, we propose SENCR, a Span Enhanced Two-stage Network with Counterfactual Rethinking for Chinese NER, that contains a boundary detector for boundary supervision, a convolution-based type classifier for better span representation and a counterfactual rethinking(CR) strategy for debiased boundary detection in inference. The proposed boundary detector and type classifier are jointly trained with the same contextual encoder and then the trained boundary detector is debiased by our proposed CR strategy without modifying any model parameters in the inference stage. Extensive experiments on four Chinese NER datasets show the effectiveness of our proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation b/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation new file mode 100644 index 0000000000..9ee7d43df9 --- /dev/null +++ b/data/2024/aaai/SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation @@ -0,0 +1 @@ +Image-level weakly supervised semantic segmentation has received increasing attention due to its low annotation cost. Existing methods mainly rely on Class Activation Mapping (CAM) to obtain pseudo-labels for training semantic segmentation models. In this work, we are the first to demonstrate that long-tailed distribution in training data can cause the CAM calculated through classifier weights over-activated for head classes and under-activated for tail classes due to the shared features among head- and tail- classes. This degrades pseudo-label quality and further influences final semantic segmentation performance. To address this issue, we propose a Shared Feature Calibration (SFC) method for CAM generation. Specifically, we leverage the class prototypes which carry positive shared features and propose a Multi-Scaled Distribution-Weighted (MSDW) consistency loss for narrowing the gap between the CAMs generated through classifier weights and class prototypes during training. The MSDW loss counterbalances over-activation and under-activation by calibrating the shared features in head-/tail-class classifier weights. Experimental results show that our SFC significantly improves CAM boundaries and achieves new state-of-the-art performances. The project is available at https://github.com/Barrett-python/SFC. \ No newline at end of file diff --git a/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation b/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation new file mode 100644 index 0000000000..33988e88b1 --- /dev/null +++ b/data/2024/aaai/SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation @@ -0,0 +1 @@ +In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale language model (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer. \ No newline at end of file diff --git a/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution b/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution new file mode 100644 index 0000000000..f4726e3301 --- /dev/null +++ b/data/2024/aaai/SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution @@ -0,0 +1 @@ +Depth super-resolution (DSR) aims to restore high-resolution (HR) depth from low-resolution (LR) one, where RGB image is often used to promote this task. Recent image guided DSR approaches mainly focus on spatial domain to rebuild depth structure. However, since the structure of LR depth is usually blurry, only considering spatial domain is not very sufficient to acquire satisfactory results. In this paper, we propose structure guided network (SGNet), a method that pays more attention to gradient and frequency domains, both of which have the inherent ability to capture high-frequency structure. Specifically, we first introduce the gradient calibration module (GCM), which employs the accurate gradient prior of RGB to sharpen the LR depth structure. Then we present the Frequency Awareness Module (FAM) that recursively conducts multiple spectrum differencing blocks (SDB), each of which propagates the precise high-frequency components of RGB into the LR depth. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our SGNet, reaching the state-of-the-art (see Fig. 1). Codes and pre-trained models are available at https://github.com/yanzq95/SGNet. \ No newline at end of file diff --git a/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features b/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features new file mode 100644 index 0000000000..449021cae3 --- /dev/null +++ b/data/2024/aaai/SHAP@k: Efficient and Probably Approximately Correct (PAC) Identification of Top-K Features @@ -0,0 +1 @@ +The SHAP framework provides a principled method to explain the predictions of a model by computing feature importance. Motivated by applications in finance, we introduce the Top-k Identification Problem (TkIP) (and its ordered variant TkIP- O), where the objective is to identify the subset (or ordered subset for TkIP-O) of k features corresponding to the highest SHAP values with PAC guarantees. While any sampling-based method that estimates SHAP values (such as KernelSHAP and SamplingSHAP) can be trivially adapted to solve TkIP, doing so is highly sample inefficient. Instead, we leverage the connection between SHAP values and multi-armed bandits (MAB) to show that both TkIP and TkIP-O can be reduced to variants of problems in MAB literature. This reduction allows us to use insights from the MAB literature to develop sample-efficient variants of KernelSHAP and SamplingSHAP. We propose KernelSHAP@k and SamplingSHAP@k for solving TkIP; along with KernelSHAP-O and SamplingSHAP-O to solve the ordering problem in TkIP-O. We perform extensive experiments using several credit-related datasets to show that our methods offer significant improvements of up to 40× in sample efficiency and 39× in runtime. \ No newline at end of file diff --git a/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation b/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation new file mode 100644 index 0000000000..b5e26dfaee --- /dev/null +++ b/data/2024/aaai/SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation @@ -0,0 +1 @@ +High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of 1.4x faster than ViTPose-Base. Code is available at https://github.com/AnxQ/sharpose. \ No newline at end of file diff --git a/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking b/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking new file mode 100644 index 0000000000..4b7c23045d --- /dev/null +++ b/data/2024/aaai/SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking @@ -0,0 +1 @@ +Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain an open challenge. Meanwhile, a systematic study of the cost-performance tradeoff for the popular tracking-by-detection paradigm is still lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First, we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help SMILETrack achieve an improved trade-off between the cost (e.g., running speed) and performance (e.g., tracking accuracy) over several existing state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets. Code is available at http://github.com/pingyang1117/SMILEtrack_official. \ No newline at end of file diff --git a/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks b/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks new file mode 100644 index 0000000000..85650b07bd --- /dev/null +++ b/data/2024/aaai/SNN-PDE: Learning Dynamic PDEs from Data with Simplicial Neural Networks @@ -0,0 +1 @@ +Dynamics of many complex systems, from weather and climate to spread of infectious diseases, can be described by partial differential equations (PDEs). Such PDEs involve unknown function(s), partial derivatives, and typically multiple independent variables. The traditional numerical methods for solving PDEs assume that the data are observed on a regular grid. However, in many applications, for example, weather and air pollution monitoring delivered by the arbitrary located weather stations of the National Weather Services, data records are irregularly spaced. Furthermore, in problems involving prediction analytics such as forecasting wildfire smoke plumes, the primary focus may be on a set of irregular locations associated with urban development. In recent years, deep learning (DL) methods and, in particular, graph neural networks (GNNs) have emerged as a new promising tool that can complement traditional PDE solvers in scenarios of the irregular spaced data, contributing to the newest research trend of physics informed machine learning (PIML). However, most existing PIML methods tend to be limited in their ability to describe higher dimensional structural properties exhibited by real world phenomena, especially, ones that live on manifolds. To address this fundamental challenge, we bring the elements of the Hodge theory and, in particular, simplicial convolution defined on the Hodge Laplacian to the emerging nexus of DL and PDEs. In contrast to conventional Laplacian and the associated convolution operation, the simplicial convolution allows us to rigorously describe diffusion across higher order structures and to better approximate the complex underlying topology and geometry of the data. The new approach, Simplicial Neural Networks for Partial Differential Equations (SNN PDE) offers a computationally efficient yet effective solution for time dependent PDEs. Our studies of a broad range of synthetic data and wildfire processes demonstrate that SNN PDE improves upon state of the art baselines in handling unstructured grids and irregular time intervals of complex physical systems and offers competitive forecasting capabilities for weather and air quality forecasting. \ No newline at end of file diff --git a/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces b/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces new file mode 100644 index 0000000000..1a896f4419 --- /dev/null +++ b/data/2024/aaai/SOCIALGYM 2.0: Simulator for Multi-Robot Learning and Navigation in Shared Human Spaces @@ -0,0 +1 @@ +We present Social Gym 2.0, a simulator for multi-agent navigation research. Our simulator enables navigation for multiple autonomous agents, replicating real-world dynamics in complex indoor environments, including doorways, hallways, intersections, and roundabouts. Unlike current simulators that concentrate on single robots in open spaces, Social Gym 2.0 employs multi-agent reinforcement learning (MARL) to develop optimal navigation policies for multiple robots with diverse, dynamic constraints in complex environments. Social Gym 2.0 also departs from the accepted software design standards by employing a configuration-over-convention paradigm providing the capability to benchmark different MARL algorithms, as well as customize observation and reward functions. Users can additionally create their own environments and evaluate various algorithms, based on both deep reinforcement learning as well as classical navigation, using a broad range of social navigation metrics. \ No newline at end of file diff --git a/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection b/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection new file mode 100644 index 0000000000..abf0504f5c --- /dev/null +++ b/data/2024/aaai/SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection @@ -0,0 +1,10 @@ +In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. +Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. +However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. +In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. +In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. +Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. +To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. +Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). +This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. +The codes are available at: https://github.com/zhouqiu/SOGDet. \ No newline at end of file diff --git a/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space b/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space new file mode 100644 index 0000000000..d48d10cfa2 --- /dev/null +++ b/data/2024/aaai/SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space @@ -0,0 +1 @@ +Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for large-scale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally. \ No newline at end of file diff --git a/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection b/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection new file mode 100644 index 0000000000..7987e39f13 --- /dev/null +++ b/data/2024/aaai/SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection @@ -0,0 +1 @@ +Current 3D object detection methods for indoor scenes mainly follow the voting-and-grouping strategy to generate proposals. However, most methods utilize instance-agnostic groupings, such as ball query, leading to inconsistent semantic information and inaccurate regression of the proposals. To this end, we propose a novel superpoint grouping network for indoor anchor-free one-stage 3D object detection. Specifically, we first adopt an unsupervised manner to partition raw point clouds into superpoints, areas with semantic consistency and spatial similarity. Then, we design a geometry-aware voting module that adapts to the centerness in anchor-free detection by constraining the spatial relationship between superpoints and object centers. Next, we present a superpoint-based grouping module to explore the consistent representation within proposals. This module includes a superpoint attention layer to learn feature interaction between neighboring superpoints, and a superpoint-voxel fusion layer to propagate the superpoint-level information to the voxel level. Finally, we employ effective multiple matching to capitalize on the dynamic receptive fields of proposals based on superpoints during the training. Experimental results demonstrate our method achieves state-of-the-art performance on ScanNet V2, SUN RGB-D, and S3DIS datasets in the indoor one-stage 3D object detection. Source code is available at https://github.com/zyrant/SPGroup3D. \ No newline at end of file diff --git a/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation b/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation new file mode 100644 index 0000000000..46e9578b07 --- /dev/null +++ b/data/2024/aaai/SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation @@ -0,0 +1 @@ +Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structure priors from ego-motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. We show that, the self-cost volume is an effective inductive bias for geometry learning, which implicitly models the single-frame scene geometry, with each slice of it indicating a relative distance map between points and objects in a latent space. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance, and showcases computational efficiency, reduced training complexity, and the ability to recover fine-grained scene details. Moreover, the self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code is available at https://github.com/hisfog/SfMNeXt-Impl. \ No newline at end of file diff --git a/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression b/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression new file mode 100644 index 0000000000..379b36d9b2 --- /dev/null +++ b/data/2024/aaai/SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression @@ -0,0 +1 @@ +Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance. Our code is available at https://github.com/retsuh-bqw/SRFormer-Text-Det. \ No newline at end of file diff --git a/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation b/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation new file mode 100644 index 0000000000..9b7e51c2d9 --- /dev/null +++ b/data/2024/aaai/SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation @@ -0,0 +1 @@ +Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability. \ No newline at end of file diff --git a/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering b/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering new file mode 100644 index 0000000000..39de3cd990 --- /dev/null +++ b/data/2024/aaai/STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering @@ -0,0 +1,4 @@ +Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. +To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. +Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. +Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR \ No newline at end of file diff --git a/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models b/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models new file mode 100644 index 0000000000..8b72933ed7 --- /dev/null +++ b/data/2024/aaai/STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models @@ -0,0 +1 @@ +Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibit higher passage quality and better align with the task definitions compared with the human-curated data. \ No newline at end of file diff --git a/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning b/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning new file mode 100644 index 0000000000..cd42bbf85e --- /dev/null +++ b/data/2024/aaai/STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning @@ -0,0 +1 @@ +Centralized Training with Decentralized Execution (CTDE) has been proven to be an effective paradigm in cooperative multi-agent reinforcement learning (MARL). One of the major challenges is credit assignment, which aims to credit agents by their contributions. They lack the functionality to model complicated relations of the delayed global reward in the temporal dimension and suffer from inefficiencies. To tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a novel method that learns credit assignment in both temporal and spatial dimensions. It first decomposes the global return back to each time step, then utilizes the Shapley Value to redistribute the individual payoff from the decomposed global reward. To mitigate the computational complexity of the Shapley Value, we introduce an approximation of marginal contribution and utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice & Bob example and MPE environments across different scenarios. Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction b/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction new file mode 100644 index 0000000000..35d4171c7b --- /dev/null +++ b/data/2024/aaai/STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction @@ -0,0 +1 @@ +Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at https://github.com/XiYe20/STDiffProject. \ No newline at end of file diff --git a/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation b/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation new file mode 100644 index 0000000000..1cf887d1aa --- /dev/null +++ b/data/2024/aaai/STEM: Unleashing the Power of Embeddings for Multi-Task Recommendation @@ -0,0 +1 @@ +Multi-task learning (MTL) has gained significant popularity in recommender systems as it enables simultaneous optimization of multiple objectives. A key challenge in MTL is negative transfer, but existing studies explored negative transfer on all samples, overlooking the inherent complexities within them. We split the samples according to the relative amount of positive feedback among tasks. Surprisingly, negative transfer still occurs in existing MTL methods on samples that receive comparable feedback across tasks. Existing work commonly employs a shared-embedding paradigm, limiting the ability of modeling diverse user preferences on different tasks. In this paper, we introduce a novel Shared and Task-specific EMbeddings (STEM) paradigm that aims to incorporate both shared and task-specific embeddings to effectively capture task-specific user preferences. Under this paradigm, we propose a simple model STEM-Net, which is equipped with an All Forward Task-specific Backward gating network to facilitate the learning of task-specific embeddings and direct knowledge transfer across tasks. Remarkably, STEM-Net demonstrates exceptional performance on comparable samples, achieving positive transfer. Comprehensive evaluation on three public MTL recommendation datasets demonstrates that STEM-Net outperforms state-of-the-art models by a substantial margin. Our code is released at https://github.com/LiangcaiSu/STEM. \ No newline at end of file diff --git a/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract) b/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract) new file mode 100644 index 0000000000..78707bc590 --- /dev/null +++ b/data/2024/aaai/STViT: Improving Self-Supervised Multi-Camera Depth Estimation with Spatial-Temporal Context and Adversarial Geometry Regularization (Student Abstract) @@ -0,0 +1 @@ +Multi-camera depth estimation has recently garnered significant attention due to its substantial practical implications in the realm of autonomous driving. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative framework, STViT, featuring several noteworthy enhancements: 1) we propose a Spatial-Temporal Transformer to comprehensively exploit both local connectivity and the global context of image features, meanwhile learning enriched spatial-temporal cross-view correlations to recover 3D geometry. 2) to alleviate the severe effect of adverse conditions, e.g., rainy weather and nighttime driving, we introduce a GAN-based Adversarial Geometry Regularization Module (AGR) to further constrain the depth estimation with unpaired normal-condition depth maps and prevent the model from being incorrectly trained. Experiments on challenging autonomous driving datasets Nuscenes and DDAD show that our method achieves state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning b/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning new file mode 100644 index 0000000000..d3fd85305f --- /dev/null +++ b/data/2024/aaai/SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning @@ -0,0 +1 @@ +Offline-to-online reinforcement learning (RL) provides a promising solution to improving suboptimal offline pre-trained policies through online fine-tuning. However, one efficient method, unconstrained fine-tuning, often suffers from severe policy collapse due to excessive distribution shift. To ensure stability, existing methods retain offline constraints and employ additional techniques during fine-tuning, which hurts efficiency. In this work, we introduce a novel perspective: eliminating the policy collapse without imposing constraints. We observe that such policy collapse arises from the mismatch between unconstrained fine-tuning and the conventional RL training framework. To this end, we propose Stabilized Unconstrained Fine-tuning (SUF), a streamlined framework that benefits from the efficiency of unconstrained fine-tuning while ensuring stability by modifying the Update-To-Data ratio. With just a few lines of code adjustments, SUF demonstrates remarkable adaptability to diverse backbones and superior performance over state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering b/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering new file mode 100644 index 0000000000..c55feb4435 --- /dev/null +++ b/data/2024/aaai/SURER: Structure-Adaptive Unified Graph Neural Network for Multi-View Clustering @@ -0,0 +1 @@ +Deep Multi-view Graph Clustering (DMGC) aims to partition instances into different groups using the graph information extracted from multi-view data. The mainstream framework of DMGC methods applies graph neural networks to embed structure information into the view-specific representations and fuse them for the consensus representation. However, on one hand, we find that the graph learned in advance is not ideal for clustering as it is constructed by original multi-view data and localized connecting. On the other hand, most existing methods learn the consensus representation in a late fusion manner, which fails to propagate the structure relations across multiple views. Inspired by the observations, we propose a Structure-adaptive Unified gRaph nEural network for multi-view clusteRing (SURER), which can jointly learn a heterogeneous multi-view unified graph and robust graph neural networks for multi-view clustering. Specifically, we first design a graph structure learning module to refine the original view-specific attribute graphs, which removes false edges and discovers the potential connection. According to the view-specific refined attribute graphs, we integrate them into a unified heterogeneous graph by linking the representations of the same sample from different views. Furthermore, we use the unified heterogeneous graph as the input of the graph neural network to learn the consensus representation for each instance, effectively integrating complementary information from various views. Extensive experiments on diverse datasets demonstrate the superior effectiveness of our method compared to other state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules b/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules new file mode 100644 index 0000000000..e3de133930 --- /dev/null +++ b/data/2024/aaai/Safe Abductive Learning in the Presence of Inaccurate Rules @@ -0,0 +1,2 @@ +Integrating complementary strengths of raw data and logical rules to improve the learning generalization has been recently shown promising and effective, e.g., abductive learning is one generic framework that can learn the perception model from data and reason between rules simultaneously. However, the performance would be seriously decreased when inaccurate logical rules appear, which may be even worse than baselines using only raw data. +Efforts on this issue are highly desired while remain to be limited. This paper proposes a simple and effective safe abductive learning method to alleviate the harm caused by inaccurate rules. Unlike the existing methods which directly use all rules without correctness checks, it utilizes them selectively by constructing a graphical model with an adaptive reasoning process to prevent performance hazards. Theoretically, we show that induction and abduction are mutually beneficial, and can be rigorously justified from a classical maximum likelihood estimation perspective. Experiments on diverse tasks show that our method can tolerate at least twice as many inaccurate rules as accurate ones and achieve highly competitive performance while other methods can't. Moreover, the proposal can refine inaccurate rules and works well in extended weakly supervised scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration b/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration new file mode 100644 index 0000000000..60c12f6288 --- /dev/null +++ b/data/2024/aaai/Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration @@ -0,0 +1 @@ +This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i) the RL agent knows a safe action set for every state or knows a safe graph in which all the state-action-state triples are safe, and (ii) the constraint/cost functions are linear. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i) and generalize (ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves O(√{d³H⁴K}) regret and O(H √{dK}) hard constraint violation when the cost function is linear and O(H?ₖ √{K}) hard constraint violation when the cost function belongs to RKHS. Here K is the learning horizon, H is the length of each episode, and ?ₖ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon K, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest. \ No newline at end of file diff --git a/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies b/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies new file mode 100644 index 0000000000..234f719292 --- /dev/null +++ b/data/2024/aaai/SafeAR: Safe Algorithmic Recourse by Risk-Aware Policies @@ -0,0 +1 @@ +With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receiving a favorable decision. Prior work on sequential algorithmic recourse---which recommends a series of changes---focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safe Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures "Value at Risk" and "Conditional Value at Risk" from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different risk-aversion levels using risk measures and recourse desiderata (sparsity and proximity). \ No newline at end of file diff --git a/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis b/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis new file mode 100644 index 0000000000..62f022109c --- /dev/null +++ b/data/2024/aaai/Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis @@ -0,0 +1 @@ +This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. As enforcing safety during training might severely limit the agent’s exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. We then propose a way to approximate moments of belief about the risk associated to the action selection policy. We demonstrate that this approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture. \ No newline at end of file diff --git a/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge b/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge new file mode 100644 index 0000000000..1190feefd5 --- /dev/null +++ b/data/2024/aaai/Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge @@ -0,0 +1 @@ +The problem of sample complexity of online reinforcement learning is often studied in the literature without taking into account any partial knowledge about the system dynamics that could potentially accelerate the learning process. In this paper, we study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We focus on systems that evolve according to an additive disturbance model of the form S_{h+1} = ƒ(S_h, A_h) + W_h, where ƒ represents the underlying system dynamics, and W_h are unknown disturbances independent of states and actions. In the setting of finite episodic Markov decision processes with S states, A actions, and episode length H, we present an optimistic Q-learning algorithm that achieves Õ(Poly(H)√T) regret under perfect knowledge of ƒ, where T is the total number of interactions with the system. This is in contrast to the typical Õ(Poly(H)√SAT) regret for existing Q-learning methods. Further, if only a noisy estimate ƒ_hat of ƒ is available, our method can learn an approximately optimal policy in a number of samples that is independent of the cardinalities of state and action spaces. The sub-optimality gap depends on the approximation error ƒ_hat − ƒ, as well as the Lipschitz constant of the corresponding optimal value function. Our approach does not require modeling of the transition probabilities and enjoys the same memory complexity as model-free methods. \ No newline at end of file diff --git a/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization b/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization new file mode 100644 index 0000000000..433fa15085 --- /dev/null +++ b/data/2024/aaai/Sample-Constrained Black Box Optimization for Audio Personalization @@ -0,0 +1,5 @@ +We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter h*, which applied to any music or speech, will maximize the user’s satisfaction. This is a black-box optimization problem since the user’s satisfaction function is unknown. Substantive work has been done on this topic where the key idea is to play audio samples to the user, each shaped by a different filter hi, and query the user for their satisfaction scores f(hi). A family of “surrogate” functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter ˆh* that maximizes satisfaction. + +In certain applications, we observe that a second type of querying is possible where users can tell us the individual elements h*[j] of the optimal filter h*. Consider an analogy from cooking where the goal is to cook a recipe that maximizes user satisfaction. A user can be asked to score various cooked recipes (e.g., tofu fried rice) or to score individual ingredients (say, salt, sugar, rice, chicken, etc.). Given a budget of B queries, where a query can be of either type, our goal is to find the recipe that will maximize this user’s satisfaction. + +Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization and solutions can benefit other applications beyond audio personalization. \ No newline at end of file diff --git a/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering b/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering new file mode 100644 index 0000000000..8d15e1769d --- /dev/null +++ b/data/2024/aaai/Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering @@ -0,0 +1 @@ +Incomplete multi-view clustering has attracted much attention due to its ability to handle partial multi-view data. Recently, similarity-based methods have been developed to explore the complete relationship among incomplete multi-view data. Although widely applied to partial scenarios, most of the existing approaches are still faced with two limitations. Firstly, fusing similarities constructed individually on each view fails to yield a complete unified similarity. Moreover, incomplete similarity generation may lead to anomalous similarity values with column sum constraints, affecting the final clustering results. To solve the above challenging issues, we propose a Sample-level Cross-view Similarity Learning (SCSL) method for Incomplete Multi-view Clustering. Specifically, we project all samples to the same dimension and simultaneously construct a complete similarity matrix across views based on the inter-view sample relationship and the intra-view sample relationship. In addition, a simultaneously learning consensus representation ensures the validity of the projection, which further enhances the quality of the similarity matrix through the graph Laplacian regularization. Experimental results on six benchmark datasets demonstrate the ability of SCSL in processing incomplete multi-view clustering tasks. Our code is publicly available at https://github.com/Tracesource/SCSL. \ No newline at end of file diff --git a/data/2024/aaai/Sample-and-Bound for Non-convex Optimization b/data/2024/aaai/Sample-and-Bound for Non-convex Optimization new file mode 100644 index 0000000000..7d2eaa3e19 --- /dev/null +++ b/data/2024/aaai/Sample-and-Bound for Non-convex Optimization @@ -0,0 +1 @@ +Standard approaches for global optimization of non-convex functions, such as branch-and-bound, maintain partition trees to systematically prune the domain. The tree size grows exponentially in the number of dimensions. We propose new sampling-based methods for non-convex optimization that adapts Monte Carlo Tree Search (MCTS) to improve efficiency. Instead of the standard use of visitation count in Upper Confidence Bounds, we utilize numerical overapproximations of the objective as an uncertainty metric, and also take into account of sampled estimates of first-order and second-order information. The Monte Carlo tree in our approach avoids the usual fixed combinatorial patterns in growing the tree, and aggressively zooms into the promising regions, while still balancing exploration and exploitation. We evaluate the proposed algorithms on high-dimensional non-convex optimization benchmarks against competitive baselines and analyze the effects of the hyper parameters. \ No newline at end of file diff --git a/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking b/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking new file mode 100644 index 0000000000..521fc06ac1 --- /dev/null +++ b/data/2024/aaai/Sampling for Beyond-Worst-Case Online Ranking @@ -0,0 +1,3 @@ +The feedback arc set problem is one of the most fundamental and well-studied ranking problems where n objects are to be ordered based on their pairwise comparison. The problem enjoys several efficient approximation algorithms in the offline setting. Unfortunately, online there are strong lower bounds on the competitive ratio establishing that no algorithm can perform well in the worst case. +This paper introduces a new beyond-worst-case model for online feedback arc set. In the model, a sample of the input is given to the algorithm offline before the remaining instance is revealed online. This models the case in practice where yesterday's data is available and is similar to today's online instance. This sample is drawn from a known distribution which may not be uniform. We design an online algorithm with strong theoretical guarantees. The algorithm has a small constant competitive ratio when the sample is uniform---if not, we show we can recover the same result by adding a provably minimal sample. +Empirical results validate the theory and show that such algorithms can be used on temporal data to obtain strong results. \ No newline at end of file diff --git a/data/2024/aaai/Sampling-Resilient Multi-Object Tracking b/data/2024/aaai/Sampling-Resilient Multi-Object Tracking new file mode 100644 index 0000000000..a65acc349c --- /dev/null +++ b/data/2024/aaai/Sampling-Resilient Multi-Object Tracking @@ -0,0 +1 @@ +Multi-Object Tracking (MOT) is a cornerstone operator for video surveillance applications. To enable real-time processing of large-scale live video streams, we study an interesting scenario called down-sampled MOT, which performs object tracking only on a small subset of video frames. The problem is challenging for state-of-the-art MOT methods, which exhibit significant performance degradation under high frame reduction ratios. In this paper, we devise a sampling-resilient tracker with a novel sparse-observation Kalman filter (SOKF). It integrates an LSTM network to capture non-linear and dynamic motion patterns caused by sparse observations. Since the LSTM-based state transition is not compatible with the original noise estimation mechanism, we propose new estimation strategies based on Bayesian neural networks and derive the optimal Kalman gain for SOKF. To associate the detected bounding boxes robustly, we also propose a comprehensive similarity metric that systematically integrates multiple spatial matching signals. Experiments on three benchmark datasets show that our proposed tracker achieves the best trade-off between efficiency and accuracy. With the same tracking accuracy, we reduce the total processing time of ByteTrack by 2× in MOT17 and 3× in DanceTrack. \ No newline at end of file diff --git a/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training b/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training new file mode 100644 index 0000000000..72329977d2 --- /dev/null +++ b/data/2024/aaai/SasWOT: Real-Time Semantic Segmentation Architecture Search WithOut Training @@ -0,0 +1 @@ +In this paper, we present SasWOT, the first training-free Semantic segmentation Architecture Search (SAS) framework via an auto-discovery proxy. Semantic segmentation is widely used in many real-time applications. For fast inference and memory efficiency, Previous SAS seeks the optimal segmenter by differentiable or RL Search. However, the significant computational costs of these training-based SAS limit their practical usage. To improve the search efficiency, we explore the training-free route but empirically observe that the existing zero-cost proxies designed on the classification task are sub-optimal on the segmentation benchmark. To address this challenge, we develop a customized proxy search framework for SAS tasks to augment its predictive capabilities. Specifically, we design the proxy search space based on the some observations: (1) different inputs of segmenter statistics can be well combined; (2) some basic operators can effectively improve the correlation. Thus, we build computational graphs with multiple statistics as inputs and different advanced basis arithmetic as the primary operations to represent candidate proxies. Then, we employ an evolutionary algorithm to crossover and mutate the superior candidates in the population based on correlation evaluation. Finally, based on the searched proxy, we perform the segmenter search without candidate training. In this way, SasWOT not only enables automated proxy optimization for SAS tasks but also achieves significant search acceleration before the retrain stage. Extensive experiments on Cityscapes and CamVid datasets demonstrate that SasWOT achieves superior trade-off between accuracy and speed over several state-of-the-art techniques. More remarkably, on Cityscapes dataset, SasWOT achieves the performance of 71.3% mIoU with the speed of 162 FPS. \ No newline at end of file diff --git a/data/2024/aaai/Say Anything with Any Style b/data/2024/aaai/Say Anything with Any Style new file mode 100644 index 0000000000..a9adc5cfad --- /dev/null +++ b/data/2024/aaai/Say Anything with Any Style @@ -0,0 +1 @@ +Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything with Any Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance as well. \ No newline at end of file diff --git a/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge b/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge new file mode 100644 index 0000000000..437e613e56 --- /dev/null +++ b/data/2024/aaai/SayCanPay: Heuristic Planning with Large Language Models Using Learnable Domain Knowledge @@ -0,0 +1 @@ +Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge". Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions' feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches. \ No newline at end of file diff --git a/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming b/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming new file mode 100644 index 0000000000..c4b470618d --- /dev/null +++ b/data/2024/aaai/Scalable Enumeration of Trap Spaces in Boolean Networks via Answer Set Programming @@ -0,0 +1 @@ +Boolean Networks (BNs) are widely used as a modeling formalism in several domains, notably systems biology and computer science. A fundamental problem in BN analysis is the enumeration of trap spaces, which are hypercubes in the state space that cannot be escaped once entered. Several methods have been proposed for enumerating trap spaces, however they often suffer from scalability and efficiency issues, particularly for large and complex models. To our knowledge, the most efficient and recent methods for the trap space enumeration all rely on Answer Set Programming (ASP), which has been widely applied to the analysis of BNs. Motivated by these considerations, our work proposes a new method for enumerating trap spaces in BNs using ASP. We evaluate the method on a mix of 250+ real-world and 400+ randomly generated BNs, showing that it enables analysis of models beyond the capabilities of existing tools (namely pyboolnet, mpbn, trappist, and trapmvn). \ No newline at end of file diff --git a/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers b/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers new file mode 100644 index 0000000000..d2e6c22a0d --- /dev/null +++ b/data/2024/aaai/Scalable Geometric Fracture Assembly via Co-creation Space among Assemblers @@ -0,0 +1 @@ +Geometric fracture assembly presents a challenging practical task in archaeology and 3D computer vision. Previous methods have focused solely on assembling fragments based on semantic information, which has limited the quantity of objects that can be effectively assembled. Therefore, there is a need to develop a scalable framework for geometric fracture assembly without relying on semantic information. To improve the effectiveness of assembling geometric fractures without semantic information, we propose a co-creation space comprising several assemblers capable of gradually and unambiguously assembling fractures. Additionally, we introduce a novel loss function, i.e., the geometric-based collision loss, to address collision issues during the fracture assembly process and enhance the results. Our framework exhibits better performance on both PartNet and Breaking Bad datasets compared to existing state-of-the-art frameworks. Extensive experiments and quantitative comparisons demonstrate the effectiveness of our proposed framework, which features linear computational complexity, enhanced abstraction, and improved generalization. Our code is publicly available at https://github.com/Ruiyuan-Zhang/CCS. \ No newline at end of file diff --git a/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation b/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation new file mode 100644 index 0000000000..8b0cd7b633 --- /dev/null +++ b/data/2024/aaai/Scalable Motion Style Transfer with Constrained Diffusion Generation @@ -0,0 +1 @@ +Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusion-based style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs. \ No newline at end of file diff --git a/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery b/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery new file mode 100644 index 0000000000..25847afbd3 --- /dev/null +++ b/data/2024/aaai/Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery @@ -0,0 +1 @@ +Object detection in aerial imagery presents a significant challenge due to large scale variations among objects. This paper proposes an evolutionary reinforcement learning agent, integrated within a coarse-to-fine object detection framework, to optimize the scale for more effective detection of objects in such images. Specifically, a set of patches potentially containing objects are first generated. A set of rewards measuring the localization accuracy, the accuracy of predicted labels, and the scale consistency among nearby patches are designed in the agent to guide the scale optimization. The proposed scale-consistency reward ensures similar scales for neighboring objects of the same category. Furthermore, a spatial-semantic attention mechanism is designed to exploit the spatial semantic relations between patches. The agent employs the proximal policy optimization strategy in conjunction with the evolutionary strategy, effectively utilizing both the current patch status and historical experience embedded in the agent. The proposed model is compared with state-of-the-art methods on two benchmark datasets for object detection on drone imagery. It significantly outperforms all the compared methods. Code is available at https://github.com/UNNC-CV/EvOD/. \ No newline at end of file diff --git a/data/2024/aaai/Scaling Few-Shot Learning for the Open World b/data/2024/aaai/Scaling Few-Shot Learning for the Open World new file mode 100644 index 0000000000..2f83ecf9cc --- /dev/null +++ b/data/2024/aaai/Scaling Few-Shot Learning for the Open World @@ -0,0 +1 @@ +Few-shot learning (FSL) aims to enable learning models with the ability to automatically adapt to novel (unseen) domains in open-world scenarios. Nonetheless, there exists a significant disparity between the vast number of new concepts encountered in the open world and the restricted available scale of existing FSL works, which primarily focus on a limited number of novel classes. Such a gap hinders the practical applicability of FSL in realistic scenarios. To bridge this gap, we propose a new problem named Few-Shot Learning with Many Novel Classes (FSL-MNC) by substantially enlarging the number of novel classes, exceeding the count in the traditional FSL setup by over 500-fold. This new problem exhibits two major challenges, including the increased computation overhead during meta-training and the degraded classification performance by the large number of classes during meta-testing. To overcome these challenges, we propose a Simple Hierarchy Pipeline (SHA-Pipeline). Due to the inefficiency of traditional protocols of EML, we re-design a lightweight training strategy to reduce the overhead brought by much more novel classes. To capture discriminative semantics across numerous novel classes, we effectively reconstruct and leverage the class hierarchy information during meta-testing. Experiments show that the proposed SHA-Pipeline significantly outperforms not only the ProtoNet baseline but also the state-of-the-art alternatives across different numbers of novel classes. \ No newline at end of file diff --git a/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction b/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction new file mode 100644 index 0000000000..1d95989d7d --- /dev/null +++ b/data/2024/aaai/Scaling Offline Evaluation of Reinforcement Learning Agents through Abstraction @@ -0,0 +1 @@ +A critical challenge for the widescale adoption of reinforcement learning (RL) is the need to give domain experts assurance that learned policies will improve decision-making -- and not lead to unacceptable behavior. To meet this challenge, my work aims to develop new methods for offline policy evaluation in real world RL domains. There has been much recent interest in offline evaluation and many advances. However, recent benchmarking efforts have also shown that there remains a substantial gap between current state-of-the-art methods and real world domains such as robotics. Towards scalable offline evaluation, my group is investigating the use of methods for abstraction and representation learning. In this new faculty highlight, I will present our recent results that show the promise of this direction for scaling offline evaluation in RL domains. I will then describe future directions in this line of that work which will further realize the promise of offline policy evaluation for increasing confidence in deployed RL. \ No newline at end of file diff --git a/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon b/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon new file mode 100644 index 0000000000..c3fed3516c --- /dev/null +++ b/data/2024/aaai/Scaling Up Pareto Optimization for Tree Structures with Affine Transformations: Evaluating Hybrid Floating Solar-Hydropower Systems in the Amazon @@ -0,0 +1 @@ +Sustainability challenges inherently involve the consideration of multiple competing objectives. The Pareto frontier – the set of all optimal solutions that cannot be improved with respect to one objective without negatively affecting another – is a crucial decision-making tool for navigating sustainability challenges as it highlights the inherent trade-offs among conflicting objectives. Our research is motivated by the strategic planning of hydropower in the Amazon basin, one of the earth’s largest and most biodiverse river systems, where the need to increase energy production coincides with the pressing requirement of minimizing detrimental environmental impacts. We investigate an innovative strategy that pairs hydropower with Floating Photovoltaic Solar Panels (FPV). We provide a new extended multi-tree network formulation, which enables the consideration of multiple dam configurations. To address the computational challenge of scaling up the Pareto optimization framework to tackle multiple objectives across the entire Amazon basin, we further enhance the state-of-the-art algorithm for Pareto frontiers in tree-structured networks with two improvements. We introduce affine transformations induced by the sub-frontiers to compute Pareto dominance and provide strategies for merging sub-trees, significantly increasing the pruning of dominated solutions. Our experiments demonstrate considerable speedups, in some cases by more than an order of magnitude, while maintaining optimality guarantees, thus allowing us to more effectively approximate the Pareto frontiers. Moreover, our findings suggest significant shifts towards higher energy values in the Pareto frontier when pairing hybrid hydropower with FPV solutions, potentially amplifying energy production while mitigating adverse impacts. \ No newline at end of file diff --git a/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data b/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data new file mode 100644 index 0000000000..06327204df --- /dev/null +++ b/data/2024/aaai/Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data @@ -0,0 +1 @@ +We propose UnMixMatch, a semi-supervised learning framework which can learn effective representations from unconstrained unlabelled data in order to scale up performance. Most existing semi-supervised methods rely on the assumption that labelled and unlabelled samples are drawn from the same distribution, which limits the potential for improvement through the use of free-living unlabeled data. Consequently, the generalizability and scalability of semi-supervised learning are often hindered by this assumption. Our method aims to overcome these constraints and effectively utilize unconstrained unlabelled data in semi-supervised learning. UnMixMatch consists of three main components: a supervised learner with hard augmentations that provides strong regularization, a contrastive consistency regularizer to learn underlying representations from the unlabelled data, and a self-supervised loss to enhance the representations that are learnt from the unlabelled data. We perform extensive experiments on 4 commonly used datasets and demonstrate superior performance over existing semi-supervised methods with a performance boost of 4.79%. Extensive ablation and sensitivity studies show the effectiveness and impact of each of the proposed components of our method. The code for our work is publicly available. \ No newline at end of file diff --git a/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment b/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment new file mode 100644 index 0000000000..4656a03ecf --- /dev/null +++ b/data/2024/aaai/Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment @@ -0,0 +1 @@ +Quality assessment of images and videos emphasizes both local details and global semantics, whereas general data sampling methods (e.g., resizing, cropping or grid-based fragment) fail to catch them simultaneously. To address the deficiency, current approaches have to adopt multi-branch models and take as input the multi-resolution data, which burdens the model complexity. In this work, instead of stacking up models, a more elegant data sampling method (named as SAMA, scaling and masking) is explored, which compacts both the local and global content in a regular input size. The basic idea is to scale the data into a pyramid first, and reduce the pyramid into a regular data dimension with a masking strategy. Benefiting from the spatial and temporal redundancy in images and videos, the processed data maintains the multi-scale characteristics with a regular input size, thus can be processed by a single-branch model. We verify the sampling method in image and video quality assessment. Experiments show that our sampling method can improve the performance of current single-branch models significantly, and achieves competitive performance to the multi-branch models without extra model complexity. The source code will be available at https://github.com/Sissuire/SAMA. \ No newline at end of file diff --git a/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding b/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding new file mode 100644 index 0000000000..df8225c185 --- /dev/null +++ b/data/2024/aaai/ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding @@ -0,0 +1 @@ +Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearance. To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU dataset is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are available in the ScanERU repository. \ No newline at end of file diff --git a/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract) b/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract) new file mode 100644 index 0000000000..2aef9d6d65 --- /dev/null +++ b/data/2024/aaai/Scene Flow Prior Based Point Cloud Completion with Masked Transformer (Student Abstract) @@ -0,0 +1 @@ +It is necessary to explore an effective point cloud completion mechanism that is of great significance for real-world tasks such as autonomous driving, robotics applications, and multi-target tracking. In this paper, we propose a point cloud completion method using a self-supervised transformer model based on the contextual constraints of scene flow. Our method uses the multi-frame point cloud context relationship as a guide to generate a series of token proposals, this priori condition ensures the stability of the point cloud completion. The experimental results show that the method proposed in this paper achieves high accuracy and good stability. \ No newline at end of file diff --git a/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research b/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research new file mode 100644 index 0000000000..14fd09d224 --- /dev/null +++ b/data/2024/aaai/SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research @@ -0,0 +1 @@ +Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval. \ No newline at end of file diff --git a/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance b/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance new file mode 100644 index 0000000000..db56772119 --- /dev/null +++ b/data/2024/aaai/SciSpace Copilot: Empowering Researchers through Intelligent Reading Assistance @@ -0,0 +1 @@ +We introduce SciSpace Copilot, an AI research assistant that helps in understanding and reading research papers faster by providing a plethora of features. Answering questions from a document has recently become popular using the Retrieval Augmented Generation (RAG) approach. Our tool uses an advanced question-answering pipeline to get accurate answers and also provide exact citations for the same. We provide many more valuable features on scientific text, including generating explanations, generating summaries, adding notes and highlights, and finding related papers from our 200 million corpus. Our tool supports 100+ languages, making research more accessible across language barriers. Thousands of users use SciSpace Copilot on a daily basis by uploading their articles to understand research faster and better. Our tool can be accessed at this link: https://typeset.io. \ No newline at end of file diff --git a/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders b/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders new file mode 100644 index 0000000000..ce055d6435 --- /dev/null +++ b/data/2024/aaai/Scores for Learning Discrete Causal Graphs with Unobserved Confounders @@ -0,0 +1 @@ +Structural learning is arguably one of the most challenging and pervasive tasks found throughout the data sciences. There exists a growing literature that studies structural learning in non-parametric settings where conditional independence constraints are taken to define the equivalence class. In the presence of unobserved confounders, it is understood that non-conditional independence constraints are imposed over the observational distribution, including certain equalities and inequalities between functionals of the joint distribution. In this paper, we develop structural learning methods that leverage additional constraints beyond conditional independences. Specifically, we first introduce a score for arbitrary graphs combining Watanabe's asymptotic expansion of the marginal likelihood and new bounds over the cardinality of the exogenous variables. Second, we show that the new score has desirable properties in terms of expressiveness and computability. In terms of expressiveness, we prove that the score captures distinct constraints imprinted in the data, including Verma's and inequalities'. In terms of computability, we show properties of score equivalence and decomposability, which allows, in principle, to break the problem of structural learning in smaller and more manageable pieces. Third, we implement this score using an MCMC sampling algorithm and test its properties in several simulation scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label b/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label new file mode 100644 index 0000000000..8040fa563e --- /dev/null +++ b/data/2024/aaai/Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label @@ -0,0 +1 @@ +Scribble-based weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific cues, which are important for semantic segmentation. In this study, we propose a class-driven scribble promotion network, which utilizes both scribble annotations and pseudo-labels informed by image-level classes and global semantics for supervision. Directly adopting pseudo-labels might misguide the segmentation model, thus we design a localization rectification module to correct foreground representations in the feature space. To further combine the advantages of both supervisions, we also introduce a distance entropy loss for uncertainty reduction, which adapts per-pixel confidence weights according to the reliable region determined by the scribble and pseudo-label's boundary. Experiments on the ScribbleSup dataset with different qualities of scribble annotations outperform all the previous methods, demonstrating the superiority and robustness of our method. The code is available at https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network. \ No newline at end of file diff --git a/data/2024/aaai/SeTformer Is What You Need for Vision and Language b/data/2024/aaai/SeTformer Is What You Need for Vision and Language new file mode 100644 index 0000000000..8ef845bf3c --- /dev/null +++ b/data/2024/aaai/SeTformer Is What You Need for Vision and Language @@ -0,0 +1 @@ +The dot product self-attention (DPSA) is a fundamental component of transformers. However, scaling them to long sequences, like documents or high-resolution images, becomes prohibitively expensive due to the quadratic time and memory complexities arising from the softmax operation. Kernel methods are employed to simplify computations by approximating softmax but often lead to performance drops compared to softmax attention. We propose SeTformer, a novel transformer where DPSA is purely replaced by Self-optimal Transport (SeT) for achieving better performance and computational efficiency. SeT is based on two essential softmax properties: maintaining a non-negative attention matrix and using a nonlinear reweighting mechanism to emphasize important tokens in input sequences. By introducing a kernel cost function for optimal transport, SeTformer effectively satisfies these properties. In particular, with small and base-sized models, SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. In object detection, SeTformer-base outperforms the FocalNet counterpart by +2.2 mAP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 mIoU with 33% fewer parameters. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight SeTformer applicability for vision and language tasks. \ No newline at end of file diff --git a/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption b/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption new file mode 100644 index 0000000000..e015d4242e --- /dev/null +++ b/data/2024/aaai/Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption @@ -0,0 +1 @@ +Distributed sparse Gaussian process (dGP) models provide an ability to achieve accurate predictive performance using data from multiple devices in a time efficient and scalable manner. The distributed computation of model, however, risks exposure of privately owned data to public manipulation. In this paper we propose a secure solution for dGP regression models using multi-key homomorphic encryption. Experimental results show that with a little sacrifice in terms of time complexity, we achieve a secure dGP model without deteriorating the predictive performance compared to traditional non-secure dGP models. We also present a practical implementation of the proposed model using several Nvidia Jetson Nano Developer Kit modules to simulate a real-world scenario. Thus, secure dGP model plugs the data security issues of dGP and provide a secure and trustworthy solution for multiple devices to use privately owned data for model computation in a distributed environment availing speed, scalability and robustness of dGP. \ No newline at end of file diff --git a/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques b/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques new file mode 100644 index 0000000000..98f0a79d5c --- /dev/null +++ b/data/2024/aaai/Securing Billion Bluetooth Devices Leveraging Learning-Based Techniques @@ -0,0 +1 @@ +As the most popular low-power communication protocol, cybersecurity research on Bluetooth Low Energy (BLE) has garnered significant attention. Due to BLE’s inherent security limitations and firmware vulnerabilities, spoofing attacks can easily compromise BLE devices and tamper with privacy data. In this paper, we proposed BLEGuard, a hybrid detection mechanism combined cyber-physical features with learning-based techniques. We established a physical network testbed to conduct attack simulations and capture advertising packets. Four different network features were utilized to implement detection and classification algorithms. Preliminary results have verified the feasibility of our proposed methods. \ No newline at end of file diff --git a/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains b/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains new file mode 100644 index 0000000000..e19c93dc1a --- /dev/null +++ b/data/2024/aaai/Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains @@ -0,0 +1 @@ +Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType. \ No newline at end of file diff --git a/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation b/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation new file mode 100644 index 0000000000..167cd1bd65 --- /dev/null +++ b/data/2024/aaai/Seeing Dark Videos via Self-Learned Bottleneck Neural Representation @@ -0,0 +1 @@ +Enhancing low-light videos in a supervised style presents a set of challenges, including limited data diversity, misalignment, and the domain gap introduced through the dataset construction pipeline. Our paper tackles these challenges by constructing a self-learned enhancement approach that gets rid of the reliance on any external training data. The challenge of self-supervised learning lies in fitting high-quality signal representations solely from input signals. Our work designs a bottleneck neural representation mechanism that extracts those signals. More in detail, we encode the frame-wise representation with a compact deep embedding and utilize a neural network to parameterize the video-level manifold consistently. Then, an entropy constraint is applied to the enhanced results based on the adjacent spatial-temporal context to filter out the degraded visual signals, e.g. noise and frame inconsistency. Last, a novel Chromatic Retinex decomposition is proposed to effectively align the reflectance distribution temporally. It benefits the entropy control on different components of each frame and facilitates noise-to-noise training, successfully suppressing the temporal flicker. Extensive experiments demonstrate the robustness and superior effectiveness of our proposed method. Our project is publicly available at: https://huangerbai.github.io/SLBNR/. \ No newline at end of file diff --git a/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation b/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation new file mode 100644 index 0000000000..1992eae3b4 --- /dev/null +++ b/data/2024/aaai/Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation @@ -0,0 +1 @@ +Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings. \ No newline at end of file diff --git a/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint) b/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint) new file mode 100644 index 0000000000..f5fb440b5b --- /dev/null +++ b/data/2024/aaai/Select and Augment: Enhanced Dense Retrieval Knowledge Graph Augmentation (Abstract Reprint) @@ -0,0 +1 @@ +Injecting textual information into knowledge graph (KG) entity representations has been a worthwhile expedition in terms of improving performance in KG oriented tasks within the NLP community. External knowledge often adopted to enhance KG embeddings ranges from semantically rich lexical dependency parsed features to a set of relevant key words to entire text descriptions supplied from an external corpus such as wikipedia and many more. Despite the gains this innovation (Text-enhanced KG embeddings) has made, the proposal in this work suggests that it can be improved even further. Instead of using a single text description (which would not sufficiently represent an entity because of the inherent lexical ambiguity of text), we propose a multi-task framework that jointly selects a set of text descriptions relevant to KG entities as well as align or augment KG embeddings with text descriptions. Different from prior work that plugs formal entity descriptions declared in knowledge bases, this framework leverages a retriever model to selectively identify richer or highly relevant text descriptions to use in augmenting entities. Furthermore, the framework treats the number of descriptions to use in augmentation process as a parameter, which allows the flexibility of enumerating across several numbers before identifying an appropriate number. Experiment results for Link Prediction demonstrate a 5.5% and 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10 scores respectively, in comparison to text-enhanced knowledge graph augmentation methods using traditional CNNs. \ No newline at end of file diff --git a/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection b/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection new file mode 100644 index 0000000000..483895853f --- /dev/null +++ b/data/2024/aaai/Selective Deep Autoencoder for Unsupervised Feature Selection @@ -0,0 +1 @@ +In light of the advances in big data, high-dimensional datasets are often encountered. Incorporating them into data-driven models can enhance performance; however, this comes at the cost of high computation and the risk of overfitting, particularly due to abundant redundant features. Identifying an informative subset of the features helps in reducing the dimensionality and enhancing model interpretability. In this paper, we propose a novel framework for unsupervised feature selection, called Selective Deep Auto-Encoder (SDAE). It aims to reduce the number of features used in unlabeled datasets without compromising the quality of information obtained. It achieves this by selecting sufficient features - from the original feature set - capable of representing the entire feature space and reconstructing them. Architecturally, it leverages the use of highly nonlinear latent representations in deep Autoencoders and intrinsically learns, in an unsupervised fashion, the relevant and globally representative subset of features through a customized Selective Layer. Extensive experimental results on three high-dimensional public datasets have shown promising feature selection performance by SDAE in comparison to other existing state-of-the-art unsupervised feature selection methods. \ No newline at end of file diff --git a/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection b/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection new file mode 100644 index 0000000000..4181ad4df9 --- /dev/null +++ b/data/2024/aaai/Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection @@ -0,0 +1 @@ +Lane detection (LD) plays a crucial role in enhancing the L2+ capabilities of autonomous driving, capturing widespread attention. The Post-Processing Quantization (PTQ) could facilitate the practical application of LD models, enabling fast speeds and limited memories without labeled data. However, prior PTQ methods do not consider the complex LD outputs that contain physical semantics, such as offsets, locations, etc., and thus cannot be directly applied to LD models. In this paper, we pioneeringly investigate semantic sensitivity to post-processing for lane detection with a novel Lane Distortion Score. Moreover, we identify two main factors impacting the LD performance after quantization, namely intra-head sensitivity and inter-head sensitivity, where a small quantization error in specific semantics can cause significant lane distortion. Thus, we propose a Selective Focus framework deployed with Semantic Guided Focus and Sensitivity Aware Selection modules, to incorporate post-processing information into PTQ reconstruction. Based on the observed intra-head sensitivity, Semantic Guided Focus is introduced to prioritize foreground-related semantics using a practical proxy. For inter-head sensitivity, we present Sensitivity Aware Selection, efficiently recognizing influential prediction heads and refining the optimization objectives at runtime. Extensive experiments have been done on a wide variety of models including keypoint-, anchor-, curve-, and segmentation-based ones. Our method produces quantized models in minutes on a single GPU and can achieve 6.4\% F1 Score improvement on the CULane dataset. Code and supplementary statement can be found at https://github.com/PannenetsF/SelectiveFocus. \ No newline at end of file diff --git a/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition b/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition new file mode 100644 index 0000000000..7b6cfdfb88 --- /dev/null +++ b/data/2024/aaai/Selective and Orthogonal Feature Activation for Pedestrian Attribute Recognition @@ -0,0 +1 @@ +Pedestrian Attribute Recognition (PAR) involves identifying the attributes of individuals in person images. Existing PAR methods typically rely on CNNs as the backbone network to extract pedestrian features. However, CNNs process only one adjacent region at a time, leading to the loss of long-range inter-relations between different attribute-specific regions. To address this limitation, we leverage the Vision Transformer (ViT) instead of CNNs as the backbone for PAR, aiming to model long-range relations and extract more robust features. However, PAR suffers from an inherent attribute imbalance issue, causing ViT to naturally focus more on attributes that appear frequently in the training set and ignore some pedestrian attributes that appear less. The native features extracted by ViT are not able to tolerate the imbalance attribute distribution issue. To tackle this issue, we propose two novel components: the Selective Feature Activation Method (SFAM) and the Orthogonal Feature Activation Loss. SFAM smartly suppresses the more informative attribute-specific features, compelling the PAR model to capture discriminative features from regions that are easily overlooked. The proposed loss enforces an orthogonal constraint on the original feature extracted by ViT and the suppressed features from SFAM, promoting the complementarity of features in space. We conduct experiments on several benchmark PAR datasets, including PETA, PA100K, RAPv1, and RAPv2, demonstrating the effectiveness of our method. Specifically, our method outperforms existing state-of-the-art approaches by GRL, IAA-Caps, ALM, and SSC in terms of mA on the four datasets, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach b/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach new file mode 100644 index 0000000000..6b6a7e4d5e --- /dev/null +++ b/data/2024/aaai/Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach @@ -0,0 +1 @@ +Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks. \ No newline at end of file diff --git a/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations b/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations new file mode 100644 index 0000000000..a50a6d038c --- /dev/null +++ b/data/2024/aaai/Self-Interpretable Graph Learning with Sufficient and Necessary Explanations @@ -0,0 +1 @@ +Self-interpretable graph learning methods provide insights to unveil the black-box nature of GNNs by providing predictions with built-in explanations. However, current works suffer from performance degradation compared to GNNs trained without built-in explanations. We argue the main reason is that they fail to generate explanations satisfying both sufficiency and necessity, and the biased explanations further hurt GNNs' performance. In this work, we propose a novel framework for generating SUfficient aNd NecessarY explanations (SUNNY-GNN for short) that benefit GNNs' predictions. The key idea is to conduct augmentations by structurally perturbing given explanations and employ a contrastive loss to guide the learning of explanations toward sufficiency and necessity directions. SUNNY-GNN introduces two coefficients to generate hard and reliable contrastive samples. We further extend SUNNY-GNN to heterogeneous graphs. Empirical results on various GNNs and real-world graphs show that SUNNY-GNN yields accurate predictions and faithful explanations, outperforming the state-of-the-art methods by improving 3.5% prediction accuracy and 13.1% explainability fidelity on average. Our code and data are available at https://github.com/SJTU-Quant/SUNNY-GNN. \ No newline at end of file diff --git a/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification b/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification new file mode 100644 index 0000000000..4fbcbe6fb7 --- /dev/null +++ b/data/2024/aaai/Self-Paced Unified Representation Learning for Hierarchical Multi-Label Classification @@ -0,0 +1 @@ +Hierarchical Multi-Label Classification (HMLC) is a well-established problem that aims at assigning data instances to multiple classes stored in a hierarchical structure. Despite its importance, existing approaches often face two key limitations: (i) They employ dense networks to solely explore the class hierarchy as hard criterion for maintaining taxonomic consistency among predicted classes, yet without leveraging rich semantic relationships between instances and classes; (ii) They struggle to generalize in settings with deep class levels, since the mini-batches uniformly sampled from different levels ignore the varying complexities of data and result in a non-smooth model adaptation to sparse data. To mitigate these issues, we present a Self-Paced Unified Representation (SPUR) learning framework, which focuses on the interplay between instance and classes to flexibly organize the training process of HMLC algorithms. Our framework consists of two lightweight encoders designed to capture the semantics of input features and the topological information of the class hierarchy. These encoders generate unified embeddings of instances and class hierarchy, which enable SPUR to exploit semantic dependencies between them and produce predictions in line with taxonomic constraints. Furthermore, we introduce a dynamic hardness measurement strategy that considers both class hierarchy and instance features to estimate the learning difficulty of each instance. This strategy is achieved by incorporating the propagation loss obtained at each hierarchical level, allowing for a more comprehensive assessment of learning complexity. Extensive experiments on several empirical benchmarks demonstrate the effectiveness and efficiency of SPUR compared to state-of-the-art methods, especially in scenarios with missing features. \ No newline at end of file diff --git a/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition b/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition new file mode 100644 index 0000000000..d60f645fe4 --- /dev/null +++ b/data/2024/aaai/Self-Prompt Mechanism for Few-Shot Image Recognition @@ -0,0 +1 @@ +Few-shot learning poses a formidable challenge as it necessitates effective recognition of novel classes based on a limited set of examples. Recent studies have sought to address the challenge of rare samples by tuning visual features through the utilization of external text prompts. However, the performance of these methods is constrained due to the inherent modality gap between the prompt text and image features. Instead of naively utilizing the external semantic information generated from text to guide the training of the image encoder, we propose a novel self-prompt mechanism (SPM) to adaptively adjust the neural network according to unseen data. Specifically, SPM involves a systematic selection of intrinsic semantic features generated by the image encoder across spatial and channel dimensions, thereby engendering self-prompt information. Subsequently, upon backpropagation of this self-prompt information to the deeper layers of the neural network, it effectively steers the network toward the learning and adaptation of new samples. Meanwhile, we propose a novel parameter-efficient tuning method that exclusively fine-tunes the parameters relevant to self-prompt (prompts are no more than 2% of the total parameters), and the incorporation of additional learnable parameters as self-prompt ensures the retention of prior knowledge through frozen encoder weights. Therefore, our method is highly suited for few-shot recognition tasks that require both information retention and adaptive adjustment of network parameters with limited labeling data constraints. Extensive experiments demonstrate the effectiveness of the proposed SPM in both 5-way 1-shot and 5-way 5-shot settings for standard single-domain and cross-domain few-shot recognition datasets, respectively. Our code is available at https://github.com/codeshop715/SPM. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning b/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning new file mode 100644 index 0000000000..90d0eaaee9 --- /dev/null +++ b/data/2024/aaai/Self-Supervised 3D Human Mesh Recovery from a Single Image with Uncertainty-Aware Learning @@ -0,0 +1 @@ +Despite achieving impressive improvement in accuracy, most existing monocular 3D human mesh reconstruction methods require large-scale 2D/3D ground-truths for supervision, which limits their applications on unlabeled in-the-wild data that is ubiquitous. To alleviate the reliance on 2D/3D ground-truths, we present a self-supervised 3D human pose and shape reconstruction framework that relies only on self-consistency between intermediate representations of images and projected 2D predictions. Specifically, we extract 2D joints and depth maps from monocular images as proxy inputs, which provides complementary clues to infer accurate 3D human meshes. Furthermore, to reduce the impacts from noisy and ambiguous inputs while better concentrate on the high-quality information, we design an uncertainty-aware module to automatically learn the reliability of the inputs at body-joint level based on the consistency between 2D joints and depth map. Experiments on benchmark datasets show that our approach outperforms other state-of-the-art methods at similar supervision levels. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals b/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals new file mode 100644 index 0000000000..f57423f46b --- /dev/null +++ b/data/2024/aaai/Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality Signals @@ -0,0 +1 @@ +Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model’s ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction b/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction new file mode 100644 index 0000000000..40679065be --- /dev/null +++ b/data/2024/aaai/Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction @@ -0,0 +1 @@ +Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data b/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data new file mode 100644 index 0000000000..22354a8b01 --- /dev/null +++ b/data/2024/aaai/Self-Supervised Framework Based on Subject-Wise Clustering for Human Subject Time Series Data @@ -0,0 +1 @@ +With the widespread adoption of IoT, wearable devices, and sensors, time series data from human subjects are significantly increasing in the healthcare domain. Due to the laborious nature of manual annotation in time series data and the requirement for human experts, self-supervised learning methods are attempted to alleviate the limited label situations. While existing self-supervised methods have been successful to achieve comparable performance to the fully supervised methods, there are still some limitations that need to be addressed, considering the nature of time series data from human subjects: In real-world clinical settings, data labels (e.g., sleep stages) are usually annotated by subject-level, and there is a substantial variation in patterns between subjects. Thus, a model should be designed to deal with not only the label scarcity but also subject-wise nature of data to ensure high performance in real-world scenarios. To mitigate these issues, we propose a novel self-supervised learning framework for human subject time series data: Subject-Aware Time Series Clustering (SA-TSC). In the unsupervised representation learning phase, SA-TSC adopts a subject-wise learning strategy rather than instance-wise learning which randomly samples data instances from different subjects within the batch during training. Specifically, we generate subject-graphs with our graph construction method based on Gumbel-Softmax and perform graph spectral clustering on each subject-graph. In addition, we utilize graph neural networks to capture dependencies between channels and design our own graph learning module motivated from self-supervised loss. Experimental results show the outstanding performance of our SA-TSC with the limited & subject-wise label setting, leading to its high applicability to the healthcare industry. The code is available at: https://github.com/DILAB-HYU/SA-TSC \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes b/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes new file mode 100644 index 0000000000..bf1adbbed6 --- /dev/null +++ b/data/2024/aaai/Self-Supervised Likelihood Estimation with Energy Guidance for Anomaly Segmentation in Urban Scenes @@ -0,0 +1 @@ +Robust autonomous driving requires agents to accurately identify unexpected areas (anomalies) in urban scenes. To this end, some critical issues remain open: how to design advisable metric to measure anomalies, and how to properly generate training samples of anomaly data? Classical effort in anomaly detection usually resorts to pixel-wise uncertainty or sample synthesis, which ignores the contextual information and sometimes requires auxiliary data with fine-grained annotations. On the contrary, in this paper, we exploit the strong context-dependent nature of segmentation task and design an energy-guided self-supervised frameworks for anomaly segmentation, which optimizes an anomaly head by maximizing likelihood of self-generated anomaly pixels. For this purpose, we design two estimators to model anomaly likelihood, one is a task-agnostic binary estimator and the other depicts the likelihood as residual of task-oriented joint energy. Based on proposed estimators, we devise an adaptive self-supervised training framework, which exploits the contextual reliance and estimated likelihood to refine mask annotations in anomaly areas. We conduct extensive experiments on challenging Fishyscapes and Road Anomaly benchmarks, demonstrating that without any auxiliary data or synthetic models, our method can still achieves comparable performance to supervised competitors. Code is available at https://github.com/yuanpengtu/SLEEG. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search b/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search new file mode 100644 index 0000000000..90f97229f1 --- /dev/null +++ b/data/2024/aaai/Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search @@ -0,0 +1 @@ +Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization b/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization new file mode 100644 index 0000000000..9db4d78b70 --- /dev/null +++ b/data/2024/aaai/Self-Supervised Representation Learning with Meta Comprehensive Regularization @@ -0,0 +1 @@ +Self-Supervised Learning (SSL) methods harness the concept of semantic invariance by utilizing data augmentation strategies to produce similar representations for different deformations of the same input. Essentially, the model captures the shared information among multiple augmented views of samples, while disregarding the non-shared information that may be beneficial for downstream tasks. To address this issue, we introduce a module called CompMod with Meta Comprehensive Regularization (MCR), embedded into existing self-supervised frameworks, to make the learned representations more comprehensive. Specifically, we update our proposed model through a bi-level optimization mechanism, enabling it to capture comprehensive features. Additionally, guided by the constrained extraction of features using maximum entropy coding, the self-supervised learning model learns more comprehensive features on top of learning consistent features. In addition, we provide theoretical support for our proposed method from information theory and causal counterfactual perspective. Experimental results show that our method achieves significant improvement in classification, object detection and semantic segmentation tasks on multiple benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation b/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation new file mode 100644 index 0000000000..9305818ced --- /dev/null +++ b/data/2024/aaai/Self-Training Based Few-Shot Node Classification by Knowledge Distillation @@ -0,0 +1,2 @@ +Self-training based few-shot node classification (FSNC) methods have shown excellent performance in real applications, but they cannot make the full use of the information in the base set and are easily affected by the quality of pseudo-labels. To address these issues, this paper proposes a new self-training FSNC method by involving the representation distillation and the pseudo-label distillation. Specifically, the representation distillation includes two knowledge distillation methods (i.e., the local representation distillation and the global representation distillation) to transfer the information in the base set to the novel set. The pseudo-label distillation is designed to conduct knowledge distillation on the pseudo-labels to improve their quality. +Experimental results showed that our method achieves supreme performance, compared with state-of-the-art methods. Our code and a comprehensive theoretical version are available at https://github.com/zongqianwu/KD-FSNC. \ No newline at end of file diff --git a/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency b/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency new file mode 100644 index 0000000000..4d72bb2849 --- /dev/null +++ b/data/2024/aaai/SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency @@ -0,0 +1 @@ +This work presents an effective depth-consistency Self-Prompt Transformer, terms as SelfPromer, for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our SelfPromer performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE. The source codes will be made available at https://github.com/supersupercong/SelfPromer. \ No newline at end of file diff --git a/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification b/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification new file mode 100644 index 0000000000..77957e0671 --- /dev/null +++ b/data/2024/aaai/SemLa: A Visual Analysis System for Fine-Grained Text Classification @@ -0,0 +1 @@ +Fine-grained text classification requires models to distinguish between many fine-grained classes that are hard to tell apart. However, despite the increased risk of models relying on confounding features and predictions being especially difficult to interpret in this context, existing work on the interpretability of fine-grained text classification is severely limited. Therefore, we introduce our visual analysis system, SemLa, which incorporates novel visualization techniques that are tailored to this challenge. Our evaluation based on case studies and expert feedback shows that SemLa can be a powerful tool for identifying model weaknesses, making decisions about data annotation, and understanding the root cause of errors. \ No newline at end of file diff --git a/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation b/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation new file mode 100644 index 0000000000..a841540cd7 --- /dev/null +++ b/data/2024/aaai/SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation @@ -0,0 +1 @@ +This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations. \ No newline at end of file diff --git a/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence b/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence new file mode 100644 index 0000000000..e94e7a3d8b --- /dev/null +++ b/data/2024/aaai/Semantic Complete Scene Forecasting from a 4D Dynamic Point Cloud Sequence @@ -0,0 +1 @@ +We study a new problem of semantic complete scene forecasting (SCSF) in this work. Given a 4D dynamic point cloud sequence, our goal is to forecast the complete scene corresponding to the future next frame along with its semantic labels. To tackle this challenging problem, we properly model the synergetic relationship between future forecasting and semantic scene completion through a novel network named SCSFNet. SCSFNet leverages a hybrid geometric representation for high-resolution complete scene forecasting. To leverage multi-frame observation as well as the understanding of scene dynamics to ease the completion task, SCSFNet introduces an attention-based skip connection scheme. To ease the need to model occlusion variations and to better focus on the occluded part, SCSFNet utilizes auxiliary visibility grids to guide the forecasting task. To evaluate the effectiveness of SCSFNet, we conduct experiments on various benchmarks including two large-scale indoor benchmarks we contributed and the outdoor SemanticKITTI benchmark. Extensive experiments show SCSFNet outperforms baseline methods on multiple metrics by a large margin, and also prove the synergy between future forecasting and semantic scene completion.The project page with code is available at scsfnet.github.io. \ No newline at end of file diff --git a/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution b/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution new file mode 100644 index 0000000000..4024ec5325 --- /dev/null +++ b/data/2024/aaai/Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution @@ -0,0 +1 @@ +As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods. \ No newline at end of file diff --git a/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention b/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention new file mode 100644 index 0000000000..6d37ad4149 --- /dev/null +++ b/data/2024/aaai/Semantic Segmentation in Multiple Adverse Weather Conditions with Domain Knowledge Retention @@ -0,0 +1 @@ +Semantic segmentation's performance is often compromised when applied to unlabeled adverse weather conditions. Unsupervised domain adaptation is a potential approach to enhancing the model's adaptability and robustness to adverse weather. However, existing methods encounter difficulties when sequentially adapting the model to multiple unlabeled adverse weather conditions. They struggle to acquire new knowledge while also retaining previously learned knowledge. To address these problems, we propose a semantic segmentation method for multiple adverse weather conditions that incorporates adaptive knowledge acquisition, pseudo-label blending, and weather composition replay. Our adaptive knowledge acquisition enables the model to avoid learning from extreme images that could potentially cause the model to forget. In our approach of blending pseudo-labels, we not only utilize the current model but also integrate the previously learned model into the ongoing learning process. This collaboration between the current teacher and the previous model enhances the robustness of the pseudo-labels for the current target. Our weather composition replay mechanism allows the model to continuously refine its previously learned weather information while simultaneously learning from the new target domain. Our method consistently outperforms the state-of-the-art methods, and obtains the best performance with averaged mIoU (%) of 65.7 and the lowest forgetting (%) of 3.6 against 60.1 and 11.3, on the ACDC datsets for a four-target continual multi-target domain adaptation. \ No newline at end of file diff --git a/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning b/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning new file mode 100644 index 0000000000..ef699010cb --- /dev/null +++ b/data/2024/aaai/Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning @@ -0,0 +1 @@ +The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings’ way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressively model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively. Code is available at https://github.com/skyoux/SemAIM. \ No newline at end of file diff --git a/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis b/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis new file mode 100644 index 0000000000..851c1a9da6 --- /dev/null +++ b/data/2024/aaai/Semantic-Aware Data Augmentation for Text-to-Image Synthesis @@ -0,0 +1 @@ +Data augmentation has been recently leveraged as an effective regularizer in various vision-language deep neural networks. However, in text-to-image synthesis (T2Isyn), current augmentation wisdom still suffers from the semantic mismatch between augmented paired data. Even worse, semantic collapse may occur when generated images are less semantically constrained. In this paper, we develop a novel Semantic-aware Data Augmentation (SADA) framework dedicated to T2Isyn. In particular, we propose to augment texts in the semantic space via an Implicit Textual Semantic Preserving Augmentation, in conjunction with a specifically designed Image Semantic Regularization Loss as Generated Image Semantic Conservation, to cope well with semantic mismatch and collapse. As one major contribution, we theoretically show that Implicit Textual Semantic Preserving Augmentation can certify better text-image consistency while Image Semantic Regularization Loss regularizing the semantics of generated images would avoid semantic collapse and enhance image quality. Extensive experiments validate that SADA enhances text-image consistency and improves image quality significantly in T2Isyn models across various backbones. Especially, incorporating SADA during the tuning process of Stable Diffusion models also yields performance improvements. \ No newline at end of file diff --git a/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align b/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align new file mode 100644 index 0000000000..7789f29ceb --- /dev/null +++ b/data/2024/aaai/Semantic-Aware Transformation-Invariant RoI Align @@ -0,0 +1 @@ +Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, i.e., Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods. The code is available at https://github.com/cxjyxxme/SemanticRoIAlign. \ No newline at end of file diff --git a/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification b/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification new file mode 100644 index 0000000000..6c14163179 --- /dev/null +++ b/data/2024/aaai/Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification @@ -0,0 +1 @@ +Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus may incorrectly change the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies. \ No newline at end of file diff --git a/data/2024/aaai/Semantic-Guided Novel Category Discovery b/data/2024/aaai/Semantic-Guided Novel Category Discovery new file mode 100644 index 0000000000..68c5f64dac --- /dev/null +++ b/data/2024/aaai/Semantic-Guided Novel Category Discovery @@ -0,0 +1 @@ +The Novel Category Discovery problem aims to cluster an unlabeled set with the help of a labeled set consisting of disjoint but related classes. However, existing models treat class names as discrete one-hot labels and ignore the semantic understanding of these classes. In this paper, we propose a new setting named Semantic-guided Novel Category Discovery (SNCD), which requires the model to not only cluster the unlabeled images but also semantically recognize these images based on a set of their class names. The first challenge we confront pertains to effectively leveraging the class names of unlabeled images, given the inherent gap between the visual and linguistic domains. To address this issue, we incorporate a semantic-aware recognition mechanism. This is achieved by constructing dynamic class-wise visual prototypes as well as a semantic similarity matrix that enables the projection of visual features into the semantic space. The second challenge originates from the granularity disparity between the classification and clustering tasks. To deal with this, we develop a semantic-aware clustering process to facilitate the exchange of knowledge between the two tasks. Through extensive experiments, we demonstrate the mutual benefits of the recognition and clustering tasks, which can be jointly optimized. Experimental results on multiple datasets confirm the effectiveness of our proposed method. Our code is available at https://github.com/wang-weishuai/Semantic-guided-NCD. \ No newline at end of file diff --git a/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning b/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning new file mode 100644 index 0000000000..76276bdd14 --- /dev/null +++ b/data/2024/aaai/Semi-Supervised Blind Image Quality Assessment through Knowledge Distillation and Incremental Learning @@ -0,0 +1 @@ +Blind Image Quality Assessment (BIQA) aims to simulate human assessment of image quality. It has a great demand for labeled data, which is often insufficient in practice. Some researchers employ unsupervised methods to address this issue, which is challenging to emulate the human subjective system. To this end, we introduce a unified framework that combines semi-supervised and incremental learning to address the mentioned issue. Specifically, when training data is limited, semi-supervised learning is necessary to infer extensive unlabeled data. To facilitate semi-supervised learning, we use knowledge distillation to assign pseudo-labels to unlabeled data, preserving analytical capability. To gradually improve the quality of pseudo labels, we introduce incremental learning. However, incremental learning can lead to catastrophic forgetting. We employ Experience Replay by selecting representative samples during multiple rounds of semi-supervised learning, to alleviate forgetting and ensure model stability. Experimental results show that the proposed approach achieves state-of-the-art performance across various benchmark datasets. After being trained on the LIVE dataset, our method can be directly transferred to the CSIQ dataset. Compared with other methods, it significantly outperforms unsupervised methods on the CSIQ dataset with a marginal performance drop (-0.002) on the LIVE dataset. In conclusion, our proposed method demonstrates its potential to tackle the challenges in real-world production processes. \ No newline at end of file diff --git a/data/2024/aaai/Semi-factual Explanations in AI b/data/2024/aaai/Semi-factual Explanations in AI new file mode 100644 index 0000000000..ca084be114 --- /dev/null +++ b/data/2024/aaai/Semi-factual Explanations in AI @@ -0,0 +1 @@ +Most of the recent works on post-hoc example-based eXplainable AI (XAI) methods revolves around employing counterfactual explanations to provide justification of the predictions made by AI systems. Counterfactuals show what changes to the input-features change the output decision. However, a lesser-known, special-case of the counterfacual is the semi-factual, which provide explanations about what changes to the input-features do not change the output decision. Semi-factuals are potentially as useful as counterfactuals but have received little attention in the XAI literature. My doctoral research aims to establish a comprehensive framework for the use of semi-factuals in XAI by developing novel methods for their computation, supported by user tests. \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix b/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix new file mode 100644 index 0000000000..b31d7e6329 --- /dev/null +++ b/data/2024/aaai/Semi-supervised 3D Object Detection with PatchTeacher and PillarMix @@ -0,0 +1 @@ +Semi-supervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. Specifically, we divide a complete scene into a series of patches and feed them to our PatchTeacher sequentially. PatchTeacher leverages the low memory consumption advantage of partial scene detection to process point clouds with a high-resolution voxelization, which can minimize the information loss of quantization and extract more fine-grained features. However, it is non-trivial to train a detector on fractions of the scene. Therefore, we introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher. Moreover, we devise PillarMix, a strong data augmentation strategy that mixes truncated pillars from different LiDAR scans to generate diverse training samples and thus help the model learn more general representation. Extensive experiments conducted on Waymo and ONCE datasets verify the effectiveness and superiority of our method and we achieve new state-of-the-art results, surpassing existing methods by a large margin. Codes are available at https://github.com/LittlePey/PTPM. \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection b/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection new file mode 100644 index 0000000000..d49c9d626b --- /dev/null +++ b/data/2024/aaai/Semi-supervised Active Learning for Video Action Detection @@ -0,0 +1,23 @@ +In this work, we focus on label efficient learning for video +action detection. We develop a novel semi-supervised active +learning approach which utilizes both labeled as well as un- +labeled data along with informative sample selection for ac- +tion detection. Video action detection requires spatio-temporal +localization along with classification, which poses several +challenges for both active learning (informative sample se- +lection) as well as semi-supervised learning (pseudo label +generation). First, we propose NoiseAug, a simple augmenta- +tion strategy which effectively selects informative samples for +video action detection. Next, we propose fft-attention, a novel +technique based on high-pass filtering which enables effective +utilization of pseudo label for SSL in video action detection +by emphasizing on relevant activity region within a video. +We evaluate the proposed approach on three different bench- +mark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. +First, we demonstrate its effectiveness on video action detec- +tion where the proposed approach outperforms prior works in +semi-supervised and weakly-supervised learning along with +several baseline approaches in both UCF101-24 and JHMDB- +21. Next, we also show its effectiveness on Youtube-VOS for +video object segmentation demonstrating its generalization +capability for other dense prediction tasks in videos. \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix b/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix new file mode 100644 index 0000000000..85b8d02daa --- /dev/null +++ b/data/2024/aaai/Semi-supervised Class-Agnostic Motion Prediction with Pseudo Label Regeneration and BEVMix @@ -0,0 +1 @@ +Class-agnostic motion prediction methods aim to comprehend motion within open-world scenarios, holding significance for autonomous driving systems. However, training a high-performance model in a fully-supervised manner always requires substantial amounts of manually annotated data, which can be both expensive and time-consuming to obtain. To address this challenge, our study explores the potential of semi-supervised learning (SSL) for class-agnostic motion prediction. Our SSL framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data by generating pseudo labels through test-time inference. To improve the quality of pseudo labels, we propose a novel motion selection and re-generation module. This module effectively selects reliable pseudo labels and re-generates unreliable ones. Furthermore, we propose two data augmentation strategies: temporal sampling and BEVMix. These strategies facilitate consistency regularization in SSL. Experiments conducted on nuScenes demonstrate that our SSL method can surpass the self-supervised approach by a large margin by utilizing only a tiny fraction of labeled data. Furthermore, our method exhibits comparable performance to weakly and some fully supervised methods. These results highlight the ability of our method to strike a favorable balance between annotation costs and performance. Code will be available at https://github.com/kwwcv/SSMP. \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach b/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach new file mode 100644 index 0000000000..b87bb24050 --- /dev/null +++ b/data/2024/aaai/Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach @@ -0,0 +1,3 @@ +Modeling dynamical systems is crucial for a wide range of tasks, but it remains challenging due to complex nonlinear dynamics, limited observations, or lack of prior knowledge. Recently, data-driven approaches such as Neural Ordinary Differential Equations (NODE) have shown promising results by leveraging the expressive power of neural networks to model unknown dynamics. However, these approaches often suffer from limited labeled training data, leading to poor generalization and suboptimal predictions. On the other hand, semi-supervised algorithms can utilize abundant unlabeled data and have demonstrated good performance in classification and regression tasks. +We propose TS-NODE, the first semi-supervised approach to modeling dynamical systems with NODE. TS-NODE explores cheaply generated synthetic pseudo rollouts to broaden exploration in the state space and to tackle the challenges brought by lack of ground-truth system data under a teacher-student model. TS-NODE employs an unified optimization framework that corrects the teacher model based on the student's feedback while mitigating the potential false system dynamics present in pseudo rollouts. +TS-NODE demonstrates significant performance improvements over a baseline Neural ODE model on multiple dynamical system modeling tasks. \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised Open-World Object Detection b/data/2024/aaai/Semi-supervised Open-World Object Detection new file mode 100644 index 0000000000..88a66405a7 --- /dev/null +++ b/data/2024/aaai/Semi-supervised Open-World Object Detection @@ -0,0 +1 @@ +Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this formulation less realistic in a real-world deployment. To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner. We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data. We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information. On the COCO dataset, our SS-OWFormer using only 50% of the labeled data achieves detection performance that is on par with the state-of-the-art (SOTA) OWOD detector using all the 100% of labeled data. Further, our SS-OWFormer achieves an absolute gain of 4.8% in unknown recall over the SOTA OWOD detector. Lastly, we demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach. Our source code, models and splits are available here https://github.com/sahalshajim/SS-OWFormer \ No newline at end of file diff --git a/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting b/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting new file mode 100644 index 0000000000..f941426efc --- /dev/null +++ b/data/2024/aaai/Semi-supervised TEE Segmentation via Interacting with SAM Equipped with Noise-Resilient Prompting @@ -0,0 +1 @@ +Semi-supervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge for pseudo labeling, which is biased due to the distribution mismatch between the highly imbalanced labeled and unlabeled data. Segmenting left atrial appendage (LAA) from transesophageal echocardiogram (TEE) images is a typical medical image segmentation task featured by scarcity of professional annotations and diverse data distributions, for which existing SSL models cannot achieve satisfactory performance. In this paper, we propose a novel strategy to mitigate the inherent challenge of distribution mismatch in SSL by, for the first time, incorporating a large foundation model (i.e. SAM in our implementation) into an SSL model to improve the quality of pseudo labels. We further propose a new self-reconstruction mechanism to generate both noise-resilient prompts to demonically improve SAM’s generalization capability over TEE images and self-perturbations to stabilize the training process and reduce the impact of noisy labels. We conduct extensive experiments on an in-house TEE dataset; experimental results demonstrate that our method achieves better performance than state-of-the-art SSL models. \ No newline at end of file diff --git a/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference b/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference new file mode 100644 index 0000000000..c78b6efd24 --- /dev/null +++ b/data/2024/aaai/SentinelLMs: Encrypted Input Adaptation and Fine-Tuning of Language Models for Private and Secure Inference @@ -0,0 +1 @@ +This paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern AI-based applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. The original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively. \ No newline at end of file diff --git a/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation b/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation new file mode 100644 index 0000000000..1231ec9337 --- /dev/null +++ b/data/2024/aaai/Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation @@ -0,0 +1 @@ +Large language models (LLMs) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. While parameter-efficient modules (PEMs) have demonstrated their effectiveness in equipping models with new skills, leveraging PEMs for deficiency unlearning remains underexplored. In this work, we propose a PEMs operation approach, namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and detoxification of LLMs through the integration of ``expert'' PEM and ``anti-expert'' PEM. Remarkably, even anti-expert PEM possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. Rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert PEM while preserving the general capabilities. To evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on LLMs, encompassing additional abilities such as language modelling and mathematical reasoning. Our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of LLMs. \ No newline at end of file diff --git a/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding b/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding new file mode 100644 index 0000000000..a87a340a84 --- /dev/null +++ b/data/2024/aaai/SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding @@ -0,0 +1 @@ +Large language models (LLMs) have shown impressive abilities for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our models are accessible at https://github.com/Alibaba-NLP/SeqGPT. \ No newline at end of file diff --git a/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects b/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects new file mode 100644 index 0000000000..99cd736e08 --- /dev/null +++ b/data/2024/aaai/SeqRank: Sequential Ranking of Salient Objects @@ -0,0 +1 @@ +Salient Object Ranking (SOR) is the process of predicting the order of an observer's attention to objects when viewing a complex scene. Existing SOR methods primarily focus on ranking various scene objects simultaneously by exploring their spatial and semantic properties. However, their solutions of simultaneously ranking all salient objects do not align with human viewing behavior, and may result in incorrect attention shift predictions. We observe that humans view a scene through a sequential and continuous process involving a cycle of foveating to objects of interest with our foveal vision while using peripheral vision to prepare for the next fixation location. For instance, when we see a flying kite, our foveal vision captures the kite itself, while our peripheral vision can help us locate the person controlling it such that we can smoothly divert our attention to it next. By repeatedly carrying out this cycle, we can gain a thorough understanding of the entire scene. Based on this observation, we propose to model the dynamic interplay between foveal and peripheral vision to predict human attention shifts sequentially. To this end, we propose a novel SOR model, SeqRank, which reproduces foveal vision to extract high-acuity visual features for accurate salient instance segmentation while also modeling peripheral vision to select the object that is likely to grab the viewer’s attention next. By incorporating both types of vision, our model can mimic human viewing behavior better and provide a more faithful ranking among various scene objects. Most notably, our model improves the SA-SOR/MAE scores by +6.1%/-13.0% on IRSR, compared with the state-of-the-art. Extensive experiments show the superior performance of our model on the SOR benchmarks. Code is available at https://github.com/guanhuankang/SeqRank. \ No newline at end of file diff --git a/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking b/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking new file mode 100644 index 0000000000..7961f0eb30 --- /dev/null +++ b/data/2024/aaai/Sequential Fusion Based Multi-Granularity Consistency for Space-Time Transformer Tracking @@ -0,0 +1 @@ +Regarded as a template-matching task for a long time, visual object tracking has witnessed significant progress in space-wise exploration. However, since tracking is performed on videos with substantial time-wise information, it is important to simultaneously mine the temporal contexts which have not yet been deeply explored. Previous supervised works mostly consider template reform as the breakthrough point, but they are often limited by additional computational burdens or the quality of chosen templates. To address this issue, we propose a Space-Time Consistent Transformer Tracker (STCFormer), which uses a sequential fusion framework with multi-granularity consistency constraints to learn spatiotemporal context information. We design a sequential fusion framework that recombines template and search images based on tracking results from chronological frames, fusing updated tracking states in training. To further overcome the over-reliance on the fixed template without increasing computational complexity, we design three space-time consistent constraints: Label Consistency Loss (LCL) for label-level consistency, Attention Consistency Loss (ACL) for patch-level ROI consistency, and Semantic Consistency Loss (SCL) for feature-level semantic consistency. Specifically, in ACL and SCL, the label information is used to constrain the attention and feature consistency of the target and the background, respectively, to avoid mutual interference. Extensive experiments have shown that our STCFormer outperforms many of the best-performing trackers on several popular benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint) b/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint) new file mode 100644 index 0000000000..590c3c9b50 --- /dev/null +++ b/data/2024/aaai/Sequential Model-Based Diagnosis by Systematic Search (Abstract Reprint) @@ -0,0 +1,9 @@ +Model-based diagnosis aims at identifying the real cause of a system's malfunction based on a formal system model and observations of the system behavior. To discriminate between multiple fault hypotheses (diagnoses), sequential diagnosis approaches iteratively pose queries to an oracle to acquire additional knowledge about the diagnosed system. Depending on the system type, queries can capture, e.g., system tests, probes, measurements, or expert questions. + +As the determination of optimal queries is NP-hard, state-of-the-art sequential diagnosis methods rely on a myopic one-step-lookahead analysis which has proven to constitute a particularly favorable trade-off between computational efficiency and diagnostic effectivity. Yet, this solves only a part of the problem, as various sources of complexity, such as the reliance on costly reasoning services and large numbers of or not explicitly given query candidates, remain. + +To deal with such issues, existing approaches often make assumptions about the (i) type of diagnosed system, (ii) formalism to describe the system, (iii) inference engine, (iv) type of query to be of interest, (v) query quality criterion to be adopted, or (vi) diagnosis computation algorithm to be employed. Moreover, they (vii) often cannot deal with large or implicit query spaces or with expressive logics, or (viii) require inputs that cannot always be provided. + +As a remedy, we propose a novel one-step lookahead query computation technique for sequential diagnosis that overcomes the said issues of existing methods. Our approach (1) is based on a solid theory, (2) involves a systematic search for optimal queries, (3) can operate on implicit and huge query spaces, (4) allows for a two-stage optimization of queries (wrt. their number and cost), (5) is designed to reduce expensive logical inferences to a minimum, and (6) is generally applicable. The latter means that it can deal with any type of diagnosis problem as per Reiter's theory, is applicable with any monotonic knowledge representation language, can interact with a multitude of diagnosis engines and logical reasoners, and allows for a quality optimization of queries based on any of the common criteria in the literature. + +We extensively study the performance of the novel technique using a benchmark of real-world diagnosis problems. Our findings are that our approach enables the computation of optimal queries with hardly any delay, independently of the size and complexity of the considered benchmark problem. Moreover, it proves to be highly scalable, and it outperforms the state-of-the-art method in the domain of our benchmarks by orders of magnitude in terms of computation time while always returning a qualitatively as good or better query. \ No newline at end of file diff --git a/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract) b/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract) new file mode 100644 index 0000000000..c69debbf76 --- /dev/null +++ b/data/2024/aaai/Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract) @@ -0,0 +1 @@ +The maritime industry's continuous commitment to sustainability has led to a dedicated exploration of methods to reduce vessel fuel consumption. This paper undertakes this challenge through a machine learning approach, leveraging a real-world dataset spanning two years of a passenger vessel in west coast Canada. Our focus centers on the creation of a time series forecasting model given the dynamic and static states, actions, and disturbances. This model is designed to predict dynamic states based on the actions provided, subsequently serving as an evaluative tool to assess the proficiency of the vessel's operation under the captain's guidance. Additionally, it lays the foundation for future optimization algorithms, providing valuable feedback on decision-making processes. To facilitate future studies, our code is available at https://github.com/pagand/model_optimze_vessel/tree/AAAI. \ No newline at end of file diff --git a/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning b/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning new file mode 100644 index 0000000000..787f429097 --- /dev/null +++ b/data/2024/aaai/Set Prediction Guided by Semantic Concepts for Diverse Video Captioning @@ -0,0 +1 @@ +Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics. \ No newline at end of file diff --git a/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing b/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing new file mode 100644 index 0000000000..1f2f8cae6e --- /dev/null +++ b/data/2024/aaai/Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing @@ -0,0 +1 @@ +Exploration in decentralized cooperative multi-agent reinforcement learning faces two challenges. One is that the novelty of global states is unavailable, while the novelty of local observations is biased. The other is how agents can explore in a coordinated way. To address these challenges, we propose MACE, a simple yet effective multi-agent coordinated exploration method. By communicating only local novelty, agents can take into account other agents' local novelty to approximate the global novelty. Further, we newly introduce weighted mutual information to measure the influence of one agent's action on other agents' accumulated novelty. We convert it as an intrinsic reward in hindsight to encourage agents to exert more influence on other agents' exploration and boost coordinated exploration. Empirically, we show that MACE achieves superior performance in three multi-agent environments with sparse rewards. \ No newline at end of file diff --git a/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract) b/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract) new file mode 100644 index 0000000000..8c8d6ca534 --- /dev/null +++ b/data/2024/aaai/Several Stories about High-Multiplicity EFx Allocation (Student Abstract) @@ -0,0 +1 @@ +Fair division is a topic that has significant social and industrial value. In this work, we study allocations that simultaneously satisfy definitions of fairness and efficiency: EFx and PO. First, we prove that the problem of finding such allocations is NP-hard for two agents. Then, we propose a concept for an ILP-based solving algorithm, the running time of which depends on the number of EFx allocations. We generate input data and analyze algorithm's running time based on the results obtained. \ No newline at end of file diff --git a/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling b/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling new file mode 100644 index 0000000000..98322a6fed --- /dev/null +++ b/data/2024/aaai/Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling @@ -0,0 +1 @@ +Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images. \ No newline at end of file diff --git a/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract) b/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract) new file mode 100644 index 0000000000..61cd9d4737 --- /dev/null +++ b/data/2024/aaai/Shallow Diffusion for Fast Speech Enhancement (Student Abstract) @@ -0,0 +1 @@ +Recently, the field of Speech Enhancement has witnessed the success of diffusion-based generative models. However, these diffusion-based methods used to take multiple iterations to generate high-quality samples, leading to high computational costs and inefficiency. In this paper, we propose SDFEN (Shallow Diffusion for Fast spEech eNhancement), a novel approach for addressing the inefficiency problem while enhancing the quality of generated samples by reducing the iterative steps in the reverse process of diffusion method. Specifically, we introduce the shallow diffusion strategy initiating the reverse process with an adaptive time step to accelerate inference. In addition, a dedicated noisy predictor is further proposed to guide the adaptive selection of time step. Experiment results demonstrate the superiority of the proposed SDFEN in effectiveness and efficiency. \ No newline at end of file diff --git a/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation b/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation new file mode 100644 index 0000000000..e2529afc64 --- /dev/null +++ b/data/2024/aaai/ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation @@ -0,0 +1 @@ +Accurate human shape recovery from a monocular RGB image is a challenging task because humans come in different shapes and sizes and wear different clothes. In this paper, we propose ShapeBoost, a new human shape recovery framework that achieves pixel-level alignment even for rare body shapes and high accuracy for people wearing different types of clothes. Unlike previous approaches that rely on the use of PCA-based shape coefficients, we adopt a new human shape parameterization that decomposes the human shape into bone lengths and the mean width of each part slice. This part-based parameterization technique achieves a balance between flexibility and validity using a semi-analytical shape reconstruction algorithm. Based on this new parameterization, a clothing-preserving data augmentation module is proposed to generate realistic images with diverse body shapes and accurate annotations. Experimental results show that our method outperforms other state-of-the-art methods in diverse body shape situations as well as in varied clothing situations. \ No newline at end of file diff --git a/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection b/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection new file mode 100644 index 0000000000..f73c532b05 --- /dev/null +++ b/data/2024/aaai/Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection @@ -0,0 +1 @@ +Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex real-world tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful. \ No newline at end of file diff --git a/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers b/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers new file mode 100644 index 0000000000..cc1a7fbed1 --- /dev/null +++ b/data/2024/aaai/ShareBERT: Embeddings Are Capable of Learning Hidden Layers @@ -0,0 +1,4 @@ +The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. +Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. +In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% +of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert. \ No newline at end of file diff --git a/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization b/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization new file mode 100644 index 0000000000..54512c4b57 --- /dev/null +++ b/data/2024/aaai/Sharpness-Aware Model-Agnostic Long-Tailed Domain Generalization @@ -0,0 +1 @@ +Domain Generalization (DG) aims to improve the generalization ability of models trained on a specific group of source domains, enabling them to perform well on new, unseen target domains. Recent studies have shown that methods that converge to smooth optima can enhance the generalization performance of supervised learning tasks such as classification. In this study, we examine the impact of smoothness-enhancing formulations on domain adversarial training, which combines task loss and adversarial loss objectives. Our approach leverages the fact that converging to a smooth minimum with respect to task loss can stabilize the task loss and lead to better performance on unseen domains. Furthermore, we recognize that the distribution of objects in the real world often follows a long-tailed class distribution, resulting in a mismatch between machine learning models and our expectations of their performance on all classes of datasets with long-tailed class distributions. To address this issue, we consider the domain generalization problem from the perspective of the long-tail distribution and propose using the maximum square loss to balance different classes which can improve model generalizability. Our method's effectiveness is demonstrated through comparisons with state-of-the-art methods on various domain generalization datasets. Code: https://github.com/bamboosir920/SAMALTDG. \ No newline at end of file diff --git a/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks b/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks new file mode 100644 index 0000000000..268f5e0e3a --- /dev/null +++ b/data/2024/aaai/Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Networks @@ -0,0 +1 @@ +Neuromorphic object recognition with spiking neural networks (SNNs) is the cornerstone of low-power neuromorphic computing. However, existing SNNs suffer from significant latency, utilizing 10 to 40 timesteps or more, to recognize neuromorphic objects. At low latencies, the performance of existing SNNs is drastically degraded. In this work, we propose the Shrinking SNN (SSNN) to achieve low-latency neuromorphic object recognition without reducing performance. Concretely, we alleviate the temporal redundancy in SNNs by dividing SNNs into multiple stages with progressively shrinking timesteps, which significantly reduces the inference latency. During timestep shrinkage, the temporal transformer smoothly transforms the temporal scale and preserves the information maximally. Moreover, we add multiple early classifiers to the SNN during training to mitigate the mismatch between the surrogate gradient and the true gradient, as well as the gradient vanishing/exploding, thus eliminating the performance degradation at low latency. Extensive experiments on neuromorphic datasets, CIFAR10-DVS, N-Caltech101, and DVS-Gesture have revealed that SSNN is able to improve the baseline accuracy by 6.55% ~ 21.41%. With only 5 average timesteps and without any data augmentation, SSNN is able to achieve an accuracy of 73.63% on CIFAR10-DVS. This work presents a heterogeneous temporal scale SNN and provides valuable insights into the development of high-performance, low-latency SNNs. \ No newline at end of file diff --git a/data/2024/aaai/Shuffled Deep Regression b/data/2024/aaai/Shuffled Deep Regression new file mode 100644 index 0000000000..a54f7aa985 --- /dev/null +++ b/data/2024/aaai/Shuffled Deep Regression @@ -0,0 +1 @@ +Shuffled regression is the problem of learning regression models from shuffled data that consists of a set of input features and a set of target outputs where the correspondence between the input and output is unknown. This study proposes a new deep learning method for shuffled regression called Shuffled Deep Regression (SDR). We derive the sparse and stochastic variant of the Expectation-Maximization algorithm for SDR that iteratively updates discrete latent variables and the parameters of neural networks. The effectiveness of the proposal is confirmed by benchmark data experiments. \ No newline at end of file diff --git a/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics b/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics new file mode 100644 index 0000000000..68c72ef504 --- /dev/null +++ b/data/2024/aaai/Signed Graph Neural Ordinary Differential Equation for Modeling Continuous-Time Dynamics @@ -0,0 +1 @@ +Modeling continuous-time dynamics constitutes a foundational challenge, and uncovering inter-component correlations within complex systems holds promise for enhancing the efficacy of dynamic modeling. The prevailing approach of integrating graph neural networks with ordinary differential equations has demonstrated promising performance. However, they disregard the crucial signed information potential on graphs, impeding their capacity to accurately capture real-world phenomena and leading to subpar outcomes. In response, we introduce a novel approach: a signed graph neural ordinary differential equation, adeptly addressing the limitations of miscapturing signed information. Our proposed solution boasts both flexibility and efficiency. To substantiate its effectiveness, we seamlessly integrate our devised strategies into three preeminent graph-based dynamic modeling frameworks: graph neural ordinary differential equations, graph neural controlled differential equations, and graph recurrent neural networks. Rigorous assessments encompass three intricate dynamic scenarios from physics and biology, as well as scrutiny across four authentic real-world traffic datasets. Remarkably outperforming the trio of baselines, empirical results underscore the substantial performance enhancements facilitated by our proposed approach. Our code can be found at https://github.com/beautyonce/SGODE. \ No newline at end of file diff --git a/data/2024/aaai/Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees (Abstract Reprint) b/data/2024/aaai/Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation b/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation new file mode 100644 index 0000000000..a45b832a0a --- /dev/null +++ b/data/2024/aaai/SimCS: Simulation for Domain Incremental Online Continual Segmentation @@ -0,0 +1 @@ +Continual Learning is a step towards lifelong intelligence where models continuously learn from recently collected data without forgetting previous knowledge. Existing continual learning approaches mostly focus on image classification in the class-incremental setup with clear task boundaries and unlimited computational budget. This work explores the problem of Online Domain-Incremental Continual Segmentation (ODICS), where the model is continually trained over batches of densely labeled images from different domains, with limited computation and no information about the task boundaries. ODICS arises in many practical applications. In autonomous driving, this may correspond to the realistic scenario of training a segmentation model over time on a sequence of cities. We analyze several existing continual learning methods and show that they perform poorly in this setting despite working well in class-incremental segmentation. We propose SimCS, a parameter-free method complementary to existing ones that uses simulated data to regularize continual learning. Experiments show that SimCS provides consistent improvements when combined with different CL methods. \ No newline at end of file diff --git a/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes b/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes new file mode 100644 index 0000000000..d38b396d1d --- /dev/null +++ b/data/2024/aaai/SimCalib: Graph Neural Network Calibration Based on Similarity between Nodes @@ -0,0 +1 @@ +Graph neural networks (GNNs) have exhibited impressive performance in modeling graph data as exemplified in various applications. Recently, the GNN calibration problem has attracted increasing attention, especially in cost-sensitive scenarios. Previous work has gained empirical insights on the issue, and devised effective approaches for it, but theoretical supports still fall short. In this work, we shed light on the relationship between GNN calibration and nodewise similarity via theoretical analysis. A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels. At the global level, the Mahalanobis distance between the current node and class prototypes is integrated to implicitly consider similarity between the current node and all nodes in the same class. At the local level, the similarity of node representation movement dynamics, quantified by nodewise homophily and relative degree, is considered. Informed about the application of nodewise movement patterns in analyzing nodewise behavior on the over-smoothing problem, we empirically present a possible relationship between over-smoothing and GNN calibration problem. Experimentally, we discover a correlation between nodewise similarity and model calibration improvement, in alignment with our theoretical results. Additionally, we conduct extensive experiments investigating different design factors and demonstrate the effectiveness of our proposed SimCalib framework for GNN calibration by achieving state-of-the-art performance on 14 out of 16 benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection b/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection new file mode 100644 index 0000000000..1008c912ee --- /dev/null +++ b/data/2024/aaai/SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection @@ -0,0 +1 @@ +Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8% mAP and 4.1% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill. \ No newline at end of file diff --git a/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models b/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models new file mode 100644 index 0000000000..c5f950eb2c --- /dev/null +++ b/data/2024/aaai/SimFair: Physics-Guided Fairness-Aware Learning with Simulation Models @@ -0,0 +1 @@ +Fairness-awareness has emerged as an essential building block for the responsible use of artificial intelligence in real applications. In many cases, inequity in performance is due to the change in distribution over different regions. While techniques have been developed to improve the transferability of fairness, a solution to the problem is not always feasible with no samples from the new regions, which is a bottleneck for pure data-driven attempts. Fortunately, physics-based mechanistic models have been studied for many problems with major social impacts. We propose SimFair, a physics-guided fairness-aware learning framework, which bridges the data limitation by integrating physical-rule-based simulation and inverse modeling into the training design. Using temperature prediction as an example, we demonstrate the effectiveness of the proposed SimFair in fairness preservation. \ No newline at end of file diff --git a/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation b/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation new file mode 100644 index 0000000000..6a76254f91 --- /dev/null +++ b/data/2024/aaai/SimPSI: A Simple Strategy to Preserve Spectral Information in Time Series Data Augmentation @@ -0,0 +1 @@ +Data augmentation is a crucial component in training neural networks to overcome the limitation imposed by data size, and several techniques have been studied for time series. Although these techniques are effective in certain tasks, they have yet to be generalized to time series benchmarks. We find that current data augmentation techniques ruin the core information contained within the frequency domain. To address this issue, we propose a simple strategy to preserve spectral information (SimPSI) in time series data augmentation. SimPSI preserves the spectral information by mixing the original and augmented input spectrum weighted by a preservation map, which indicates the importance score of each frequency. Specifically, our experimental contributions are to build three distinct preservation maps: magnitude spectrum, saliency map, and spectrum-preservative map. We apply SimPSI to various time series data augmentations and evaluate its effectiveness across a wide range of time series benchmarks. Our experimental results support that SimPSI considerably enhances the performance of time series data augmentations by preserving core spectral information. The source code used in the paper is available at https://github.com/Hyun-Ryu/simpsi. \ No newline at end of file diff --git a/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection b/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection new file mode 100644 index 0000000000..2583cbf39a --- /dev/null +++ b/data/2024/aaai/Simple Image-Level Classification Improves Open-Vocabulary Object Detection @@ -0,0 +1 @@ +Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages. Code is available at https://github.com/mala-lab/SIC-CADS. \ No newline at end of file diff --git a/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract) b/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract) new file mode 100644 index 0000000000..9c5e0f4cc3 --- /dev/null +++ b/data/2024/aaai/Simple Orthogonal Graph Representation Learning (Student Abstract) @@ -0,0 +1 @@ +Graph neural networks (GNNs) have attracted significant interest recently since they can effectively process and analyze graph-structured data commonly found in real-world applications. However, the predicament that GNNs are difficult to train becomes worse as the layers increase. The essence of this problem is that stacking layers will reduce the stability of forward propagation and gradient back-propagation. And as the increasing scale of models (measured by the number of parameters), how to efficiently and effectively adapt it to particular downstream tasks becomes an intriguing research issue. In this work, motivated by the effect of orthogonality constraints, we propose a simple orthogonal training framework to impose the orthogonality constraints on GNNs, which can help models find a solution vector in a specific low dimensional subspace and stabilize the signaling processes at both the forward and backward directions. Specifically, we propose a novel polar decomposition-based orthogonal initialization (PDOI-R) algorithm, which can identify the low intrinsic dimension within the Stiefel Manifold and stabilize the training process. Extensive experiments demonstrate the effectiveness of the proposed method in multiple downstream tasks, showcasing its generality. The simple method can help existing state-of-the-art models achieve better performance. \ No newline at end of file diff --git a/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures b/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures new file mode 100644 index 0000000000..9e11ba87c2 --- /dev/null +++ b/data/2024/aaai/Simple Weak Coresets for Non-decomposable Classification Measures @@ -0,0 +1 @@ +While coresets have been growing in terms of their application, barring few exceptions, they have mostly been limited to unsupervised settings. We consider supervised classification problems, and non-decomposable evaluation measures in such settings. We show that stratified uniform sampling based coresets have excellent empirical performance that are backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size, and have good empirical performance, comparable with ``smarter'' coreset construction strategies. \ No newline at end of file diff --git a/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning b/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning new file mode 100644 index 0000000000..1f4f73557c --- /dev/null +++ b/data/2024/aaai/Simplicity Bias in Overparameterized Machine Learning @@ -0,0 +1 @@ +A thorough theoretical understanding of the surprising generalization ability of deep networks (and other overparameterized models) is still lacking. Here we demonstrate that simplicity bias is a major phenomenon to be reckoned with in overparameterized machine learning. In addition to explaining the outcome of simplicity bias, we also study its source: following concrete rigorous examples, we argue that (i) simplicity bias can explain generalization in overparameterized learning models such as neural networks; (ii) simplicity bias and excellent generalization are optimizer-independent, as our example shows, and although the optimizer affects training, it is not the driving force behind simplicity bias; (iii) simplicity bias in pre-training models, and subsequent posteriors, is universal and stems from the subtle fact that uniformly-at-random constructed priors are not uniformly-at-random sampled ; and (iv) in neural network models, the biasing mechanism in wide (and shallow) networks is different from the biasing mechanism in deep (and narrow) networks. \ No newline at end of file diff --git a/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice b/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice new file mode 100644 index 0000000000..f3549b03ce --- /dev/null +++ b/data/2024/aaai/Simplifying Complex Observation Models in Continuous POMDP Planning with Probabilistic Guarantees and Practice @@ -0,0 +1 @@ +Solving partially observable Markov decision processes (POMDPs) with high dimensional and continuous observations, such as camera images, is required for many real life robotics and planning problems. Recent researches suggested machine learned probabilistic models as observation models, but their use is currently too computationally expensive for online deployment. We deal with the question of what would be the implication of using simplified observation models for planning, while retaining formal guarantees on the quality of the solution. Our main contribution is a novel probabilistic bound based on a statistical total variation distance of the simplified model. We show that it bounds the theoretical POMDP value w.r.t. original model, from the empirical planned value with the simplified model, by generalizing recent results of particle-belief MDP concentration bounds. Our calculations can be separated into offline and online parts, and we arrive at formal guarantees without having to access the costly model at all during planning, which is also a novel result. Finally, we demonstrate in simulation how to integrate the bound into the routine of an existing continuous online POMDP solver. \ No newline at end of file diff --git a/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms b/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms new file mode 100644 index 0000000000..956b7eb0c9 --- /dev/null +++ b/data/2024/aaai/Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms @@ -0,0 +1 @@ +Online advertising has been one of the most important sources for industry's growth, where the demand-side platforms (DSP) play an important role via bidding to the ad exchanges on behalf of their advertiser clients. Since more and more ad exchanges have shifted from second to first price auctions, it is challenging for DSPs to adjust bidding strategy in the volatile environment. Recent studies on bid shading in first-price auctions may have limited performance due to relatively strong hypotheses about winning probability distribution. Moreover, these studies do not consider the incentive of advertiser clients, which can be crucial for a reliable advertising platform. In this work, we consider both the optimization of bid shading technique and the design of internal auction which is ex-post incentive compatible (IC) for the management of a DSP. Firstly, we prove that the joint design of bid shading and ex-post IC auction can be reduced to choosing one monotone bid function for each advertiser without loss of optimality. Then we propose a parameterized neural network to implement the monotone bid functions. With well-designed surrogate loss, the objective can be optimized in an end-to-end manner. Finally, our experimental results demonstrate the effectiveness and superiority of our algorithm. \ No newline at end of file diff --git a/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning b/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning new file mode 100644 index 0000000000..55b122eb70 --- /dev/null +++ b/data/2024/aaai/Situation-Dependent Causal Influence-Based Cooperative Multi-Agent Reinforcement Learning @@ -0,0 +1 @@ +Learning to collaborate has witnessed significant progress in multi-agent reinforcement learning (MARL). However, promoting coordination among agents and enhancing exploration capabilities remain challenges. In multi-agent environments, interactions between agents are limited in specific situations. Effective collaboration between agents thus requires a nuanced understanding of when and how agents' actions influence others.To this end, in this paper, we propose a novel MARL algorithm named Situation-Dependent Causal Influence-Based Cooperative Multi-agent Reinforcement Learning (SCIC), which incorporates a novel Intrinsic reward mechanism based on a new cooperation criterion measured by situation-dependent causal influence among agents.Our approach aims to detect inter-agent causal influences in specific situations based on the criterion using causal intervention and conditional mutual information. This effectively assists agents in exploring states that can positively impact other agents, thus promoting cooperation between agents.The resulting update links coordinated exploration and intrinsic reward distribution, which enhance overall collaboration and performance.Experimental results on various MARL benchmarks demonstrate the superiority of our method compared to state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps b/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps new file mode 100644 index 0000000000..798571184e --- /dev/null +++ b/data/2024/aaai/SkeletonGait: Gait Recognition Using Skeleton Maps @@ -0,0 +1 @@ +The choice of the representations is essential for deep gait recognition methods. The binary silhouettes and skeletal coordinates are two dominant representations in recent literature, achieving remarkable advances in many scenarios. However, inherent challenges remain, in which silhouettes are not always guaranteed in unconstrained scenes, and structural cues have not been fully utilized from skeletons. In this paper, we introduce a novel skeletal gait representation named skeleton map, together with SkeletonGait, a skeleton-based method to exploit structural information from human skeleton maps. Specifically, the skeleton map represents the coordinates of human joints as a heatmap with Gaussian approximation, exhibiting a silhouette-like image devoid of exact body structure. Beyond achieving state-of-the-art performances over five popular gait datasets, more importantly, SkeletonGait uncovers novel insights about how important structural features are in describing gait and when they play a role. Furthermore, we propose a multi-branch architecture, named SkeletonGait++, to make use of complementary features from both skeletons and silhouettes. Experiments indicate that SkeletonGait++ outperforms existing state-of-the-art methods by a significant margin in various scenarios. For instance, it achieves an impressive rank-1 accuracy of over 85% on the challenging GREW dataset. The source code is available at https://github.com/ShiqiYu/OpenGait. \ No newline at end of file diff --git a/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes b/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes new file mode 100644 index 0000000000..30b6e43e48 --- /dev/null +++ b/data/2024/aaai/Sketched Newton Value Iteration for Large-Scale Markov Decision Processes @@ -0,0 +1 @@ +Value Iteration (VI) is one of the most classic algorithms for solving Markov Decision Processes (MDPs), which lays the foundations for various more advanced reinforcement learning algorithms, such as Q-learning. VI may take a large number of iterations to converge as it is a first-order method. In this paper, we introduce the Newton Value Iteration (NVI) algorithm, which eliminates the impact of action space dimension compared to some previous second-order methods. Consequently, NVI can efficiently handle MDPs with large action spaces. Building upon NVI, we propose a novel approach called Sketched Newton Value Iteration (SNVI) to tackle MDPs with both large state and action spaces. SNVI not only inherits the stability and fast convergence advantages of second-order algorithms, but also significantly reduces computational complexity, making it highly scalable. Extensive experiments demonstrate the superiority of our algorithms over traditional VI and previously proposed second-order VI algorithms. \ No newline at end of file diff --git a/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract) b/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract) new file mode 100644 index 0000000000..5ed906abc5 --- /dev/null +++ b/data/2024/aaai/SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract) @@ -0,0 +1 @@ +When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes. \ No newline at end of file diff --git a/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract) b/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract) new file mode 100644 index 0000000000..069fca2648 --- /dev/null +++ b/data/2024/aaai/Skip-GANomaly++: Skip Connections and Residual Blocks for Anomaly Detection (Student Abstract) @@ -0,0 +1 @@ +Anomaly detection is a critical task across various domains. Fundamentally, anomaly detection models offer methods to identify unusual patterns that do not align with expected behaviors. Notably, in the medical field, detecting anomalies in medical imagery or biometrics can facilitate early diagnosis of diseases. Consequently, we propose the Skip-GANomaly++ model, an enhanced and more efficient version of the conventional anomaly detection models. The proposed model's performance was evaluated through comparative experiments. Experimental results demonstrated superior performance across most classes compared to the previous models. \ No newline at end of file diff --git a/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution b/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution new file mode 100644 index 0000000000..5d6d3798e3 --- /dev/null +++ b/data/2024/aaai/SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution @@ -0,0 +1,2 @@ +It is well-known that image quality assessment usually meets with the problem of perception-distortion (p-d) tradeoff. The existing deep image super-resolution (SR) methods either focus on high fidelity with pixel-level objectives or high perception with generative models. The emergence of diffusion model paves a fresh way for image restoration, which has the potential to offer a brand-new solution for p-d trade-off. We experimentally observed that the perceptual quality and distortion change in an opposite direction with the increase of sampling steps. In light of this property, we propose an adaptive skip diffusion model (SkipDiff), which aims to achieve +high-fidelity perceptual image SR with fewer sampling steps. Specifically, it decouples the sampling procedure into coarse skip approximation and fine skip refinement stages. A coarse-grained skip diffusion is first performed as a high-fidelity prior to obtaining a latent approximation of the full diffusion. Then, a fine-grained skip diffusion is followed to further refine the latent sample for promoting perception, where the fine time steps are adaptively learned by deep reinforcement learning. Meanwhile, this approach also enables faster sampling of diffusion model through skipping the intermediate denoising process to shorten the effective steps of the computation. Extensive experimental results show that our SkipDiff achieves superior perceptual quality with plausible reconstruction accuracy and a faster sampling speed. \ No newline at end of file diff --git a/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing b/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing new file mode 100644 index 0000000000..30f0b597ee --- /dev/null +++ b/data/2024/aaai/SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing @@ -0,0 +1,2 @@ +Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. +With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis. \ No newline at end of file diff --git a/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract) b/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract) new file mode 100644 index 0000000000..8fafe053a8 --- /dev/null +++ b/data/2024/aaai/Sleep-Like Unsupervised Replay Improves Performance When Data Are Limited or Unbalanced (Student Abstract) @@ -0,0 +1 @@ +The performance of artificial neural networks (ANNs) degrades when training data are limited or imbalanced. In contrast, the human brain can learn quickly from just a few examples. Here, we investigated the role of sleep in improving the performance of ANNs trained with limited data on the MNIST and Fashion MNIST datasets. Sleep was implemented as an unsupervised phase with local Hebbian type learning rules. We found a significant boost in accuracy after the sleep phase for models trained with limited data in the range of 0.5-10% of total MNIST or Fashion MNIST datasets. When more than 10% of the total data was used, sleep alone had a slight negative impact on performance, but this was remedied by fine-tuning on the original data. This study sheds light on a potential synaptic weight dynamics strategy employed by the brain during sleep to enhance memory performance when training data are limited or imbalanced. \ No newline at end of file diff --git a/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples b/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples new file mode 100644 index 0000000000..6fdcb4d6aa --- /dev/null +++ b/data/2024/aaai/SlowTrack: Increasing the Latency of Camera-Based Perception in Autonomous Driving Using Adversarial Examples @@ -0,0 +1 @@ +In Autonomous Driving (AD), real-time perception is a critical component responsible for detecting surrounding objects to ensure safe driving. While researchers have extensively explored the integrity of AD perception due to its safety and security implications, the aspect of availability (real-time performance) or latency has received limited attention. Existing works on latency-based attack have focused mainly on object detection, i.e., a component in camera-based AD perception, overlooking the entire camera-based AD perception, which hinders them to achieve effective system-level effects, such as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for generating adversarial attacks to increase the execution time of camera-based AD perception. We propose a novel two-stage attack strategy along with the three new loss function designs. Our evaluation is conducted on four popular camera-based AD perception pipelines, and the results demonstrate that SlowTrack significantly outperforms existing latency-based attacks while maintaining comparable imperceptibility levels. Furthermore, we perform the evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL, a production-grade AD simulator, with two scenarios to compare the system-level effects of SlowTrack and existing attacks. Our evaluation results show that the system-level effects can be significantly improved, i.e., the vehicle crash rate of SlowTrack is around 95% on average while existing works only have around 30%. \ No newline at end of file diff --git a/data/2024/aaai/Small Language Model Can Self-Correct b/data/2024/aaai/Small Language Model Can Self-Correct new file mode 100644 index 0000000000..b5f9af1c6c --- /dev/null +++ b/data/2024/aaai/Small Language Model Can Self-Correct @@ -0,0 +1 @@ +Generative Language Models (LMs) such as ChatGPT have exhibited remarkable performance across various downstream tasks. Nevertheless, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. Previous studies have devised sophisticated pipelines and prompts to induce large LMs to exhibit the capability for self-correction. However, large LMs are explicitly prompted to verify and modify their answers separately rather than completing all steps spontaneously like humans. Moreover, these complex prompts are extremely challenging for small LMs to follow. In this paper, we introduce the Intrinsic Self-Correction (ISC) in generative language models, aiming to correct the initial output of LMs in a self-triggered manner, even for those small LMs with 6 billion parameters. Specifically, we devise a pipeline for constructing self-correction data and propose Partial Answer Masking (PAM), aiming to endow the model with the capability for intrinsic self-correction through fine-tuning. We conduct experiments using LMs with parameters sizes ranging from 6 billion to 13 billion in two tasks, including commonsense reasoning and factual knowledge reasoning. Our experiments demonstrate that the outputs generated using ISC outperform those generated without self-correction. We believe that the output quality of even small LMs can be further improved by empowering them with the ability to intrinsic self-correct. \ No newline at end of file diff --git a/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation b/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation new file mode 100644 index 0000000000..3b169951c5 --- /dev/null +++ b/data/2024/aaai/Social Physics Informed Diffusion Model for Crowd Simulation @@ -0,0 +1 @@ +Crowd simulation holds crucial applications in various domains, such as urban planning, architectural design, and traffic arrangement. In recent years, physics-informed machine learning methods have achieved state-of-the-art performance in crowd simulation but fail to model the heterogeneity and multi-modality of human movement comprehensively. In this paper, we propose a social physics-informed diffusion model named SPDiff to mitigate the above gap. SPDiff takes both the interactive and historical information of crowds in the current timeframe to reverse the diffusion process, thereby generating the distribution of pedestrian movement in the subsequent timeframe. Inspired by the well-known social physics model, i.e., Social Force, regarding crowd dynamics, we design a crowd interaction encoder to guide the denoising process and further enhance this module with the equivariant properties of crowd interactions. To mitigate error accumulation in long-term simulations, we propose a multi-frame rollout training algorithm for diffusion modeling. Experiments conducted on two real-world datasets demonstrate the superior performance of SPDiff in terms of both macroscopic and microscopic evaluation metrics. Code and appendix are available at https://github.com/tsinghua-fib-lab/SPDiff. \ No newline at end of file diff --git a/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference b/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference new file mode 100644 index 0000000000..d817c1d837 --- /dev/null +++ b/data/2024/aaai/Social-Aware Group Display Configuration in VR Conference @@ -0,0 +1 @@ +Virtual Reality (VR) has emerged due to advancements in hardware and computer graphics. During the pandemic, conferences and exhibitions leveraging VR have gained attention. However, large-scale VR conferences, face a significant problem not yet studied in the literature -- displaying too many irrelevant users on the screen which may negatively impact the user experience. To address this issue, we formulate a new research problem, Social-Aware VR Conference Group Display Configuration (SVGD). Accordingly, we design the Social Utility-Aware VR Conference Group Formation (SVC) algorithm, which is a 2-approximation algorithm to SVGD. SVC iteratively selects either the P-Configuration or S-Configuration based on their effective ratios. This ensures that in each iteration, SVC identifies and chooses the solution with the highest current effectiveness. Experiments on real metaverse datasets show that the proposed SVC outperforms 11 baselines by 75% in terms of solution quality. \ No newline at end of file diff --git a/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents b/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents new file mode 100644 index 0000000000..d537002f1e --- /dev/null +++ b/data/2024/aaai/SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents @@ -0,0 +1 @@ +Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with deterministic nature, while recent work has focused on developing hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE). Code is available at: https://github.com/ViviXiang/SocialCVAE. \ No newline at end of file diff --git a/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models b/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models new file mode 100644 index 0000000000..6538de64ad --- /dev/null +++ b/data/2024/aaai/SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models @@ -0,0 +1,3 @@ +Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 US-centric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning. + +Warning: This paper contains examples of text which are toxic, biased, and potentially harmful. \ No newline at end of file diff --git a/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger b/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger new file mode 100644 index 0000000000..21aeebe63b --- /dev/null +++ b/data/2024/aaai/SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger @@ -0,0 +1 @@ +During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline. \ No newline at end of file diff --git a/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract) b/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract) new file mode 100644 index 0000000000..d7196e807c --- /dev/null +++ b/data/2024/aaai/Solar Power Generation Forecasting via Multimodal Feature Fusion (Student Abstract) @@ -0,0 +1,2 @@ +Solar power generation has recently been in the spotlight as global warming continues to worsen. However, two significant problems may hinder solar power generation, considering that solar panels are installed outside. The first is soiling, which accumulates on solar panels, and the second is a decrease in sunlight owing to bad weather. +In this paper, we will demonstrate that the solar power generation forecasting can increase when considering soiling and sunlight information. We first introduce a dataset containing images of clean and soiled solar panels, sky images, and weather information. For accurate solar power generation forecasting, we propose a new multimodal model that aggregates various features related to weather, soiling, and sunlight. The experimental results demonstrated the high accuracy of our proposed multimodal model. \ No newline at end of file diff --git a/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization b/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization new file mode 100644 index 0000000000..fc0a6fd1be --- /dev/null +++ b/data/2024/aaai/Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization @@ -0,0 +1,2 @@ +In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. +In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty. \ No newline at end of file diff --git a/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees b/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees new file mode 100644 index 0000000000..87ee00774b --- /dev/null +++ b/data/2024/aaai/Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees @@ -0,0 +1 @@ +Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decision-making and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical AI. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature (NP^PP-complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from suboptimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC outperforms several baselines both in solution quality and running time. \ No newline at end of file diff --git a/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability b/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability new file mode 100644 index 0000000000..944111724b --- /dev/null +++ b/data/2024/aaai/Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability @@ -0,0 +1,5 @@ +With the increasing customization of spectrometers, spectral unmixing has become a widely used technique in fields such as remote sensing, textiles, and environmental protection. +However, endmember variability is a common issue for unmixing, where changes in lighting, atmospheric, temporal conditions, or the intrinsic spectral characteristics of materials, can all result in variations in the measured spectrum. +Recent studies have employed deep neural networks to tackle endmember variability. However, these approaches rely on generic networks to implicitly resolve the issue, which struggles with the ill-posed nature and lack of effective convergence constraints for endmember variability. This paper proposes a streamlined multi-task learning model to rectify this problem, incorporating abundance regression and multi-label classification with Unmixing as a Bayesian Inverse Problem, denoted as BIPU. +To address the issue of the ill-posed nature, the uncertainty of unmixing is quantified and minimized through the Laplace approximation in a Bayesian inverse solver. In addition, to improve convergence under the influence of endmember variability, the paper introduces two types of constraints. The first separates background factors of variants from the initial factors for each endmember, while the second identifies and eliminates the influence of non-existent endmembers via multi-label classification during convergence. +The effectiveness of this model is demonstrated not only on a self-collected near-infrared spectral textile dataset (FENIR), but also on three commonly used remote sensing hyperspectral image datasets, where it achieves state-of-the-art unmixing performance and exhibits strong generalization capabilities. \ No newline at end of file diff --git a/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications b/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications new file mode 100644 index 0000000000..61d0978cbf --- /dev/null +++ b/data/2024/aaai/Some Like It Small: Czech Semantic Embedding Models for Industry Applications @@ -0,0 +1 @@ +This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine. These models have effectively replaced previous counterparts, enhancing the overall search experience for instance, in organic search, featured snippets, and image search. This transition has yielded improved performance. \ No newline at end of file diff --git a/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer b/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer new file mode 100644 index 0000000000..7acadf84a0 --- /dev/null +++ b/data/2024/aaai/SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer @@ -0,0 +1 @@ +Saccadic scanpath, a data representation of human visual behavior, has received broad interest in multiple domains. Scanpath is a complex eye-tracking data modality that includes the sequences of fixation positions and fixation duration, coupled with image information. However, previous methods usually face the spatial misalignment problem of fixation features and loss of critical temporal data (including temporal correlation and fixation duration). In this study, we propose a Transformer-based scanpath model, SpFormer, to alleviate these problems. First, we propose a fixation-centric paradigm to extract the aligned spatial fixation features and tokenize the scanpaths. Then, according to the visual working memory mechanism, we design a local meta attention to reduce the semantic redundancy of fixations and guide the model to focus on the meta scanpath. Finally, we progressively integrate the duration information and fuse it with the fixation features to solve the problem of ambiguous location with the Transformer block increasing. We conduct extensive experiments on four databases under three tasks. The SpFormer establishes new state-of-the-art results in distinct settings, verifying its flexibility and versatility in practical applications. The code can be obtained from https://github.com/wenqizhong/SpFormer. \ No newline at end of file diff --git a/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation b/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation new file mode 100644 index 0000000000..334d7818fa --- /dev/null +++ b/data/2024/aaai/SpaceGTN: A Time-Agnostic Graph Transformer Network for Handwritten Diagram Recognition and Segmentation @@ -0,0 +1 @@ +Online handwriting recognition is pivotal in domains like note-taking, education, healthcare, and office tasks. Existing diagram recognition algorithms mainly rely on the temporal information of strokes, resulting in a decline in recognition performance when dealing with notes that have been modified or have no temporal information. The current datasets are drawn based on templates and cannot reflect the real free-drawing situation. To address these challenges, we present SpaceGTN, a time-agnostic Graph Transformer Network, leveraging spatial integration and removing the need for temporal data. Extensive experiments on multiple datasets have demonstrated that our method consistently outperforms existing methods and achieves state-of-the-art performance. We also propose a pipeline that seamlessly connects offline and online handwritten diagrams. By integrating a stroke restoration technique with SpaceGTN, it enables intelligent editing of previously uneditable offline diagrams at the stroke level. In addition, we have also launched the first online handwritten diagram dataset, OHSD, which is collected using a free-drawing method and comes with modification annotations. \ No newline at end of file diff --git a/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition b/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition new file mode 100644 index 0000000000..3f6cbd67be --- /dev/null +++ b/data/2024/aaai/Span Graph Transformer for Document-Level Named Entity Recognition @@ -0,0 +1 @@ +Named Entity Recognition (NER), which aims to identify the span and category of entities within text, is a fundamental task in natural language processing. Recent NER approaches have featured pre-trained transformer-based models (e.g., BERT) as a crucial encoding component to achieve state-of-the-art performance. However, due to the length limit for input text, these models typically consider text at the sentence-level and cannot capture the long-range contextual dependency within a document. To address this issue, we propose a novel Span Graph Transformer (SGT) method for document-level NER, which constructs long-range contextual dependencies at both the token and span levels. Specifically, we first retrieve relevant contextual sentences in the document for each target sentence, and jointly encode them by BERT to capture token-level dependencies. Then, our proposed model extracts candidate spans from each sentence and integrates these spans into a document-level span graph, where nested spans within sentences and identical spans across sentences are connected. By leveraging the power of Graph Transformer and well-designed position encoding, our span graph can fully exploit span-level dependencies within the document. Extensive experiments on both resource-rich nested and flat NER datasets, as well as low-resource distantly supervised NER datasets, demonstrate that proposed SGT model achieves better performance than previous state-of-the-art models. \ No newline at end of file diff --git a/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales b/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales new file mode 100644 index 0000000000..c28e5a1b4e --- /dev/null +++ b/data/2024/aaai/Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales @@ -0,0 +1 @@ +With the alarming rise of hate speech in online communities, the demand for effective NLP models to identify instances of offensive language has reached a critical point. However, the development of such models heavily relies on the availability of annotated datasets, which are scarce, particularly for less-studied languages. To bridge this gap for the Persian language, we present a novel dataset specifically tailored to multi-label hate speech detection. Our dataset, called Phate, consists of an extensive collection of over seven thousand manually-annotated Persian tweets, offering a rich resource for training and evaluating hate speech detection models on this language. Notably, each annotation in our dataset specifies the targeted group of hate speech and includes a span of the tweet which elucidates the rationale behind the assigned label. The incorporation of these information expands the potential applications of our dataset, facilitating the detection of targeted online harm or allowing the benchmark to serve research on interpretability of hate speech detection models. The dataset, annotation guideline, and all associated codes are accessible at https://github.com/Zahra-D/Phate. \ No newline at end of file diff --git a/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction b/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction new file mode 100644 index 0000000000..de7380fbb1 --- /dev/null +++ b/data/2024/aaai/Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction @@ -0,0 +1 @@ +Cross domain medical image reconstruction aims to address the issue that deep learning models trained solely on one source dataset might not generalize effectively to unseen target datasets from different hospitals. Some recent methods achieve satisfactory reconstruction performance, but often at the expense of extensive parameters and time consumption. To strike a balance between cross domain image reconstruction quality and model computational efficiency, we propose a lightweight sparse Bayesian deep learning method. Notably, we apply a fixed-form variational Bayes (FFVB) approach to quantify pixel-wise uncertainty priors derived from degradation distribution of the source domain. Furthermore, by integrating the uncertainty prior into the posterior sampled through stochastic gradient Langevin dynamics (SGLD), we develop a training strategy that dynamically generates and optimizes the prior distribution on the network weights for each unseen domain. This strategy enhances generalizability and ensures robust reconstruction performance. When evaluated on medical image reconstruction tasks, our proposed approach demonstrates impressive performance across various previously unseen domains. \ No newline at end of file diff --git a/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation b/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation new file mode 100644 index 0000000000..43bd23f462 --- /dev/null +++ b/data/2024/aaai/Sparse Enhanced Network: An Adversarial Generation Method for Robust Augmentation in Sequential Recommendation @@ -0,0 +1,3 @@ +Sequential Recommendation plays a significant role in daily recommendation systems, such as e-commerce platforms like Amazon and Taobao. However, even with the advent of large models, these platforms often face sparse issues in the historical browsing records of individual users due to new users joining or the introduction of new products. As a result, existing sequence recommendation algorithms may not perform well. To address this, sequence-based data augmentation methods have garnered attention. + +Existing sequence enhancement methods typically rely on augmenting existing data, employing techniques like cropping, masking prediction, random reordering, and random replacement of the original sequence. While these methods have shown improvements, they often overlook the exploration of the deep embedding space of the sequence. To tackle these challenges, we propose a Sparse Enhanced Network (SparseEnNet), which is a robust adversarial generation method. SparseEnNet aims to fully explore the hidden space in sequence recommendation, generating more robust enhanced items. Additionally, we adopt an adversarial generation method, allowing the model to differentiate between data augmentation categories and achieve better prediction performance for the next item in the sequence. Experiments have demonstrated that our method achieves a remarkable 4-14% improvement over existing methods when evaluated on the real-world datasets. (https://github.com/junyachen/SparseEnNet) \ No newline at end of file diff --git a/data/2024/aaai/Sparse Variational Student-t Processes b/data/2024/aaai/Sparse Variational Student-t Processes new file mode 100644 index 0000000000..ff36a49975 --- /dev/null +++ b/data/2024/aaai/Sparse Variational Student-t Processes @@ -0,0 +1 @@ +The theory of Bayesian learning incorporates the use of Student-t Processes to model heavy-tailed distributions and datasets with outliers. However, despite Student-t Processes having a similar computational complexity as Gaussian Processes, there has been limited emphasis on the sparse representation of this model. This is mainly due to the increased difficulty in modeling and computation compared to previous sparse Gaussian Processes. Our motivation is to address the need for a sparse representation framework that reduces computational complexity, allowing Student-t Processes to be more flexible for real-world datasets. To achieve this, we leverage the conditional distribution of Student-t Processes to introduce sparse inducing points. Bayesian methods and variational inference are then utilized to derive a well-defined lower bound, facilitating more efficient optimization of our model through stochastic gradient descent. We propose two methods for computing the variational lower bound, one utilizing Monte Carlo sampling and the other employing Jensen's inequality to compute the KL regularization term in the loss function. We propose adopting these approaches as viable alternatives to Gaussian processes when the data might contain outliers or exhibit heavy-tailed behavior, and we provide specific recommendations for their applicability. We evaluate the two proposed approaches on various synthetic and real-world datasets from UCI and Kaggle, demonstrating their effectiveness compared to baseline methods in terms of computational complexity and accuracy, as well as their robustness to outliers. \ No newline at end of file diff --git a/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views b/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views new file mode 100644 index 0000000000..2280b189b1 --- /dev/null +++ b/data/2024/aaai/Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views @@ -0,0 +1 @@ +Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction. \ No newline at end of file diff --git a/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images b/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images new file mode 100644 index 0000000000..7b2beac5e8 --- /dev/null +++ b/data/2024/aaai/SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images @@ -0,0 +1 @@ +We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing scene context and guidance for the target novel view. The second module utilizes a transformer-based network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens back to the image of the target view. SparseGNV is trained across a large-scale indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on real-world indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation. \ No newline at end of file diff --git a/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention b/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention new file mode 100644 index 0000000000..335e22f17f --- /dev/null +++ b/data/2024/aaai/Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention @@ -0,0 +1 @@ +Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements. \ No newline at end of file diff --git a/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection b/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection new file mode 100644 index 0000000000..cf13181da2 --- /dev/null +++ b/data/2024/aaai/Spatial Transform Decoupling for Oriented Object Detection @@ -0,0 +1 @@ +Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling. \ No newline at end of file diff --git a/data/2024/aaai/Spatial Voting with Incomplete Voter Information b/data/2024/aaai/Spatial Voting with Incomplete Voter Information new file mode 100644 index 0000000000..ac2b56fc96 --- /dev/null +++ b/data/2024/aaai/Spatial Voting with Incomplete Voter Information @@ -0,0 +1,2 @@ +We consider spatial voting where candidates are located in the Euclidean d-dimensional space, and each voter ranks candidates based on their distance from the voter's ideal point. We explore the case where information about the location of voters' ideal points is incomplete: for each dimension, we are given an interval of possible values. We study the computational complexity of finding the possible and necessary winners for positional scoring rules. Our results show that we retain tractable cases of the classic model where voters have partial-order preferences. Moreover, we show that there are positional scoring rules under which the possible-winner problem is intractable for partial orders, but tractable in the one-dimensional spatial setting. +We also consider approval voting in this setting. We show that for up to two dimensions, the necessary-winner problem is tractable, while the possible-winner problem is hard for any number of dimensions. \ No newline at end of file diff --git a/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion b/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion new file mode 100644 index 0000000000..117dd7b04d --- /dev/null +++ b/data/2024/aaai/Spatial-Contextual Discrepancy Information Compensation for GAN Inversion @@ -0,0 +1 @@ +Most existing GAN inversion methods either achieve accurate reconstruction but lack editability or offer strong editability at the cost of fidelity. Hence, how to balance the distortion-editability trade-off is a significant challenge for GAN inversion. To address this challenge, we introduce a novel spatial-contextual discrepancy information compensation-based GAN-inversion method (SDIC), which consists of a discrepancy information prediction network (DIPN) and a discrepancy information compensation network (DICN). SDIC follows a ``compensate-and-edit'' paradigm and successfully bridges the gap in image details between the original image and the reconstructed/edited image. On the one hand, DIPN encodes the multi-level spatial-contextual information of the original and initial reconstructed images and then predicts a spatial-contextual guided discrepancy map with two hourglass modules. In this way, a reliable discrepancy map that models the contextual relationship and captures fine-grained image details is learned. On the other hand, DICN incorporates the predicted discrepancy information into both the latent code and the GAN generator with different transformations, generating high-quality reconstructed/edited images. This effectively compensates for the loss of image details during GAN inversion. Both quantitative and qualitative experiments demonstrate that our proposed method achieves the excellent distortion-editability trade-off at a fast inference speed for both image inversion and editing tasks. Our code is available at https://github.com/ZzqLKED/SDIC. \ No newline at end of file diff --git a/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery b/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery new file mode 100644 index 0000000000..fbf8e2f0a7 --- /dev/null +++ b/data/2024/aaai/Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery @@ -0,0 +1 @@ +Flood mapping on Earth imagery is crucial for disaster management, but its efficacy is hampered by the lack of high-quality training labels. Given high-resolution Earth imagery with coarse and noisy training labels, a base deep neural network model, and a spatial knowledge base with label constraints, our problem is to infer the true high-resolution labels while training neural network parameters. Traditional methods are largely based on specific physical properties and thus fall short of capturing the rich domain constraints expressed by symbolic logic. Neural-symbolic models can capture rich domain knowledge, but existing methods do not address the unique spatial challenges inherent in flood mapping on high-resolution imagery. To fill this gap, we propose a spatial-logic-aware weakly supervised learning framework. Our framework integrates symbolic spatial logic inference into probabilistic learning in a weakly supervised setting. To reduce the time costs of logic inference on vast high-resolution pixels, we propose a multi-resolution spatial reasoning algorithm to infer true labels while training neural network parameters. Evaluations of real-world flood datasets show that our model outperforms several baselines in prediction accuracy. The code is available at https://github.com/spatialdatasciencegroup/SLWSL. \ No newline at end of file diff --git a/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract) b/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract) new file mode 100644 index 0000000000..9c27c6dd55 --- /dev/null +++ b/data/2024/aaai/Spatial-Temporal Augmentation for Crime Prediction (Student Abstract) @@ -0,0 +1 @@ +Crime prediction stands as a pivotal concern within the realm of urban management due to its potential threats to public safety. While prior research has predominantly focused on unraveling the intricate dependencies among urban regions and temporal dynamics, the challenges posed by the scarcity and uncertainty of historical crime data have not been thoroughly investigated. This study introduces an innovative spatial-temporal augmented learning framework for crime prediction, namely STAug. In STAug, we devise a CrimeMix to improve the ability of generalization. Furthermore, we harness a spatial-temporal aggregation to capture and incorporate multiple correlations covering the temporal, spatial, and crime-type aspects. Experiments on two real-world datasets underscore the superiority of STAug over several baselines. \ No newline at end of file diff --git a/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation b/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation new file mode 100644 index 0000000000..a3e0373985 --- /dev/null +++ b/data/2024/aaai/Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation @@ -0,0 +1 @@ +In the realm of human mobility, the decision-making process for selecting the next-visit location is intricately influenced by a trade-off between spatial and temporal constraints, which are reflective of individual needs and preferences. This trade-off, however, varies across individuals, making the modeling of these spatial-temporal dynamics a formidable challenge. To address the problem, in this work, we introduce the "Spatial-temporal Induced Hierarchical Reinforcement Learning" (STI-HRL) framework, for capturing the interplay between spatial and temporal factors in human mobility decision-making. Specifically, STI-HRL employs a two-tiered decision-making process: the low-level focuses on disentangling spatial and temporal preferences using dedicated agents, while the high-level integrates these considerations to finalize the decision. To complement the hierarchical decision setting, we construct a hypergraph to organize historical data, encapsulating the multi-aspect semantics of human mobility. We propose a cross-channel hypergraph embedding module to learn the representations as the states to facilitate the decision-making cycle. Our extensive experiments on two real-world datasets validate the superiority of STI-HRL over state-of-the-art methods in predicting users' next visits across various performance metrics. \ No newline at end of file diff --git a/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph b/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph new file mode 100644 index 0000000000..d9cba4f9c9 --- /dev/null +++ b/data/2024/aaai/Spatio-Temporal Fusion for Human Action Recognition via Joint Trajectory Graph @@ -0,0 +1 @@ +Graph Convolutional Networks (GCNs) and Transformers have been widely applied to skeleton-based human action recognition, with each offering unique advantages in capturing spatial relationships and long-range dependencies. However, for most GCN methods, the construction of topological structures relies solely on the spatial information of human joints, limiting their ability to directly capture richer spatio-temporal dependencies. Additionally, the self-attention modules of many Transformer methods lack topological structure information, restricting the robustness and generalization of the models. To address these issues, we propose a Joint Trajectory Graph (JTG) that integrates spatio-temporal information into a uniform graph structure. We also present a Joint Trajectory GraphFormer (JT-GraphFormer), which directly captures the spatio-temporal relationships among all joint trajectories for human action recognition. To better integrate topological information into spatio-temporal relationships, we introduce a Spatio-Temporal Dijkstra Attention (STDA) mechanism to calculate relationship scores for all the joints in JTG. Furthermore, we incorporate the Koopman operator into the classification stage to enhance the model's representation ability and classification performance. Experiments demonstrate that JT-GraphFormer achieves outstanding performance in human action recognition tasks, outperforming state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and N-UCLA datasets. \ No newline at end of file diff --git a/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting b/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting new file mode 100644 index 0000000000..f7f86a0468 --- /dev/null +++ b/data/2024/aaai/Spatio-Temporal Pivotal Graph Neural Networks for Traffic Flow Forecasting @@ -0,0 +1 @@ +Traffic flow forecasting is a classical spatio-temporal data mining problem with many real-world applications. Recently, various methods based on Graph Neural Networks (GNN) have been proposed for the problem and achieved impressive prediction performance. However, we argue that the majority of existing methods disregarding the importance of certain nodes (referred to as pivotal nodes) that naturally exhibit extensive connections with multiple other nodes. Predicting on pivotal nodes poses a challenge due to their complex spatio-temporal dependencies compared to other nodes. In this paper, we propose a novel GNN-based method called Spatio-Temporal Pivotal Graph Neural Networks (STPGNN) to address the above limitation. We introduce a pivotal node identification module for identifying pivotal nodes. We propose a novel pivotal graph convolution module, enabling precise capture of spatio-temporal dependencies centered around pivotal nodes. Moreover, we propose a parallel framework capable of extracting spatio-temporal traffic features on both pivotal and non-pivotal nodes. Experiments on seven real-world traffic datasets verify our proposed method's effectiveness and efficiency compared to state-of-the-art baselines. \ No newline at end of file diff --git a/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs b/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs new file mode 100644 index 0000000000..869ef89129 --- /dev/null +++ b/data/2024/aaai/Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs @@ -0,0 +1,10 @@ +Real-world graphs are dynamic, constantly evolving with new interactions, such as financial transactions in financial networks. +Temporal Graph Neural Networks (TGNNs) have been developed to effectively capture the evolving patterns in dynamic graphs. +While these models have demonstrated their superiority, being widely adopted in various important fields, their vulnerabilities against adversarial attacks remain largely unexplored. +In this paper, we propose T-SPEAR, a simple and effective adversarial attack method for link prediction on continuous-time dynamic graphs, focusing on investigating the vulnerabilities of TGNNs. +Specifically, before the training procedure of a victim model, which is a TGNN for link prediction, we inject edge perturbations to the data that are unnoticeable in terms of the four constraints we propose, and yet effective enough to cause malfunction of the victim model. +Moreover, we propose a robust training approach T-SHIELD to mitigate the impact of adversarial attacks. +By using edge filtering and enforcing temporal smoothness to node embeddings, we enhance the robustness of the victim model. +Our experimental study shows that T-SPEAR significantly degrades the victim model's performance on link prediction tasks, and even more, our attacks are transferable to other TGNNs, which differ from the victim model assumed by the attacker. +Moreover, we demonstrate that T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR. +The code and datasets are available at https://github.com/wooner49/T-spear-shield \ No newline at end of file diff --git a/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation b/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation new file mode 100644 index 0000000000..db828eb273 --- /dev/null +++ b/data/2024/aaai/Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation @@ -0,0 +1,6 @@ +Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. +The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. +To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. +Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. +Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. +Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes. \ No newline at end of file diff --git a/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation b/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation new file mode 100644 index 0000000000..e9b27aa353 --- /dev/null +++ b/data/2024/aaai/Spectral-Based Graph Neural Networks for Complementary Item Recommendation @@ -0,0 +1,2 @@ +Modeling complementary relationships greatly helps recommender systems to accurately and promptly recommend the subsequent items when one item is purchased. Unlike traditional similar relationships, items with complementary relationships may be purchased successively (such as iPhone and Airpods Pro), and they not only share relevance but also exhibit dissimilarity. Since the two attributes are opposites, modeling complementary relationships is challenging. Previous attempts to exploit these relationships have either ignored or oversimplified the dissimilarity attribute, resulting in ineffective modeling and an inability to balance the two attributes. Since Graph Neural Networks (GNNs) can capture the relevance and dissimilarity between nodes in the spectral domain, we can leverage spectral-based GNNs to effectively understand and model complementary relationships. +In this study, we present a novel approach called Spectral-based Complementary Graph Neural Networks (SComGNN) that utilizes the spectral properties of complementary item graphs. We make the first observation that complementary relationships consist of low-frequency and mid-frequency components, corresponding to the relevance and dissimilarity attributes, respectively. Based on this spectral observation, we design spectral graph convolutional networks with low-pass and mid-pass filters to capture the low-frequency and mid-frequency components. Additionally, we propose a two-stage attention mechanism to adaptively integrate and balance the two attributes. Experimental results on four e-commerce datasets demonstrate the effectiveness of our model, with SComGNN significantly outperforming existing baseline models. \ No newline at end of file diff --git a/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field b/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field new file mode 100644 index 0000000000..9312a7aa89 --- /dev/null +++ b/data/2024/aaai/SpectralNeRF: Physically Based Spectral Rendering with Neural Radiance Field @@ -0,0 +1 @@ +In this paper, we propose SpectralNeRF, an end-to-end Neural Radiance Field (NeRF)-based architecture for high-quality physically based rendering from a novel spectral perspective. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. Our SpectralNeRF follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and Spectrum Attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Comprehensive experimental results demonstrate the proposed SpectralNeRF is superior to recent NeRF-based methods when synthesizing new views on synthetic and real datasets. The codes and datasets are available at https://github.com/liru0126/SpectralNeRF. \ No newline at end of file diff --git a/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile b/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile new file mode 100644 index 0000000000..54de31445c --- /dev/null +++ b/data/2024/aaai/Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile @@ -0,0 +1 @@ +Currently, image generation and synthesis have remarkably progressed with generative models. Despite photo-realistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the generated images to improve generative performance of both GAN and diffusion models. This is realized by spectrum translation for the refinement of image generation (STIG) based on contrastive learning. We adopt theoretical logic of frequency components in various generative networks. The key idea, here, is to refine the spectrum of the generated image via the concept of image-to-image translation and contrastive learning in terms of digital signal processing. We evaluate our framework across eight fake image datasets and various cutting-edge models to demonstrate the effectiveness of STIG. Our framework outperforms other cutting-edges showing significant decreases in FID and log frequency distance of spectrum. We further emphasize that STIG improves image quality by decreasing the spectral anomaly. Additionally, validation results present that the frequency-based deepfake detector confuses more in the case where fake spectrums are manipulated by STIG. \ No newline at end of file diff --git a/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model b/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model new file mode 100644 index 0000000000..d74ea8d9d9 --- /dev/null +++ b/data/2024/aaai/SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model @@ -0,0 +1 @@ +Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average. \ No newline at end of file diff --git a/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution b/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution new file mode 100644 index 0000000000..2a5a8e4719 --- /dev/null +++ b/data/2024/aaai/Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution @@ -0,0 +1 @@ +Omnidirectional images have attracted significant attention in recent years due to the rapid development of virtual reality technologies. Equirectangular projection (ERP), a naive form to store and transfer omnidirectional images, however, is challenging for existing two-dimensional (2D) image super-resolution (SR) methods due to its inhomogeneous distributed sampling density and distortion across latitude. In this paper, we make one of the first attempts to design a spherical pseudo-cylindrical representation, which not only allows pixels at different latitudes to adaptively adopt the best distinct sampling density but also is model-agnostic to most off-the-shelf SR methods, enhancing their performances. Specifically, we start by upsampling each latitude of the input ERP image and design a computationally tractable optimization algorithm to adaptively obtain a (sub)-optimal sampling density for each latitude of the ERP image. Addressing the distortion of ERP, we introduce a new viewport-based training loss based on the original 3D sphere format of the omnidirectional image, which inherently lacks distortion. Finally, we present a simple yet effective recursive progressive omnidirectional SR network to showcase the feasibility of our idea. The experimental results on public datasets demonstrate the effectiveness of the proposed method as well as the consistently superior performance of our method over most state-of-the-art methods both quantitatively and qualitatively. \ No newline at end of file diff --git a/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation b/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation new file mode 100644 index 0000000000..9c26278142 --- /dev/null +++ b/data/2024/aaai/Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation @@ -0,0 +1,2 @@ +A crucial reason for the success of existing NeRF-based methods is to build a neural density field for the geometry representation via multiple perceptron layers (MLPs). +MLPs are continuous functions, however, real geometry or density field is frequently discontinuous at the interface between the air and the surface. Such a contrary brings the problem of unfaithful geometry representation. To this end, this paper proposes spiking NeRF, which leverages spiking neurons and a hybrid Artificial Neural Network (ANN)-Spiking Neural Network (SNN) framework to build a discontinuous density field for faithful geometry representation. Specifically, we first demonstrate the reason why continuous density fields will bring inaccuracy. Then, we propose to use the spiking neurons to build a discontinuous density field. We conduct a comprehensive analysis for the problem of existing spiking neuron models and then provide the numerical relationship between the parameter of the spiking neuron and the theoretical accuracy of geometry. Based on this, we propose a bounded spiking neuron to build the discontinuous density field. Our method achieves SOTA performance. The source code and the supplementary material are available at https://github.com/liaozhanfeng/Spiking-NeRF. \ No newline at end of file diff --git a/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation b/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation new file mode 100644 index 0000000000..858a3b24fe --- /dev/null +++ b/data/2024/aaai/SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation @@ -0,0 +1 @@ +Large language Models (LLMs), though growing exceedingly powerful, comprises of orders of magnitude less neurons and synapses than the human brain. However, it requires significantly more power/energy to operate. In this work, we propose a novel bio-inspired spiking language model (LM) which aims to reduce the computational cost of conventional LMs by drawing motivation from the synaptic information flow in the brain. In this paper, we demonstrate a framework that leverages the average spiking rate of neurons at equilibrium to train a neuromorphic spiking LM using implicit differentiation technique, thereby overcoming the non-differentiability problem of spiking neural network (SNN) based algorithms without using any type of surrogate gradient. The steady-state convergence of the spiking neurons also allows us to design a spiking attention mechanism, which is critical in developing a scalable spiking LM. Moreover, the convergence of average spiking rate of neurons at equilibrium is utilized to develop a novel ANN-SNN knowledge distillation based technique wherein we use a pre-trained BERT model as “teacher” to train our “student” spiking architecture. While the primary architecture proposed in this paper is motivated by BERT, the technique can be potentially extended to different kinds of LLMs. Our work is the first one to demonstrate the performance of an operational spiking LM architecture on multiple different tasks in the GLUE benchmark. Our implementation source code is available at https://github.com/NeuroCompLab-psu/SpikingBERT. \ No newline at end of file diff --git a/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator b/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator new file mode 100644 index 0000000000..3ad0676778 --- /dev/null +++ b/data/2024/aaai/Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator @@ -0,0 +1 @@ +Layout generation is a critical step in graphic design to achieve meaningful compositions of elements. Most previous works view it as a sequence generation problem by concatenating element attribute tokens (i.e., category, size, position). So far the autoregressive approach (AR) has achieved promising results, but is still limited in global context modeling and suffers from error propagation since it can only attend to the previously generated tokens. Recent non-autoregressive attempts (NAR) have shown competitive results, which provides a wider context range and the flexibility to refine with iterative decoding. However, current works only use simple heuristics to recognize erroneous tokens for refinement which is inaccurate. This paper first conducts an in-depth analysis to better understand the difference between the AR and NAR framework. Furthermore, based on our observation that pixel space is more sensitive in capturing spatial patterns of graphic layouts (e.g., overlap, alignment), we propose a learning-based locator to detect erroneous tokens which takes the wireframe image rendered from the generated layout sequence as input. We show that it serves as a complementary modality to the element sequence in object space and contributes greatly to the overall performance. Experiments on two public datasets show that our approach outperforms both AR and NAR baselines. Extensive studies further prove the effectiveness of different modules with interesting findings. Our code will be available at https://github.com/ffffatgoose/SpotError. \ No newline at end of file diff --git a/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes b/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes new file mode 100644 index 0000000000..1d148a54cc --- /dev/null +++ b/data/2024/aaai/Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes @@ -0,0 +1 @@ +Humans often require only a few visual archetypes to spot novel objects. Based on this observation, we present a strategy rooted in ``spotting the unseen" by establishing dense correspondences between potential query image regions and a visual archetype, and we propose the Consensus Network (CoNet). Our method leverages relational patterns intra and inter images via Auto-Correlation Representation (ACR) and Mutual-Correlation Representation (MCR). Within each image, the ACR module is capable of encoding both local self-similarity and global context simultaneously. Between the query and support images, the MCR module computes the cross-correlation across two image representations and introduces a reciprocal consistency constraint, which can incorporate to exclude outliers and enhance model robustness. To overcome the challenges of low-resource training data, particularly in one-shot learning scenarios, we incorporate an adaptive margin strategy to better handle diverse instances. The experimental results indicate the effectiveness of the proposed method across diverse domains such as object detection in natural scenes, and text spotting in both historical manuscripts and natural scenes, which demonstrates its sparkling generalization ability. Our code is available at: https://github.com/infinite-hwb/conet. \ No newline at end of file diff --git a/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions b/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions new file mode 100644 index 0000000000..946bfeb68f --- /dev/null +++ b/data/2024/aaai/Stability Analysis of Switched Linear Systems with Neural Lyapunov Functions @@ -0,0 +1,2 @@ +Neural-based, data-driven analysis and control of dynamical systems have been recently investigated and have shown great promise, e.g. for safety verification or stability analysis. Indeed, not only do neural networks allow for an entirely model-free, data-driven approach, but also for handling arbitrary complex functions via their power of representation (as opposed to, e.g. algebraic optimization techniques that are restricted to polynomial functions). Whilst classical Lyapunov techniques allow to provide a formal and robust guarantee of stability of a switched dynamical system, very little is yet known about correctness guarantees for Neural Lyapunov functions, nor about their performance (amount of data needed for a certain accuracy). +We formally introduce Neural Lyapunov functions for the stability analysis of switched linear systems: we benchmark them on this paradigmatic problem, which is notoriously difficult (and in general Turing-undecidable), but which admits existing recently-developed technologies and theoretical results. Inspired by switched systems theory, we provide theoretical guarantees on the representative power of neural networks, leveraging recent results from the ML community. We additionally experimentally display how Neural Lyapunov functions compete with state-of-the-art results and techniques, while admitting a wide range of improvement, both in theory and in practice. This study intends to improve our understanding of the opportunities and current limitations of neural-based data-driven analysis and control of complex dynamical systems. \ No newline at end of file diff --git a/data/2024/aaai/Stability in Online Coalition Formation b/data/2024/aaai/Stability in Online Coalition Formation new file mode 100644 index 0000000000..d479fa03ab --- /dev/null +++ b/data/2024/aaai/Stability in Online Coalition Formation @@ -0,0 +1 @@ +Coalition formation is concerned with the question of how to partition a set of agents into disjoint coalitions according to their preferences. Deviating from most of the previous work, we consider an online variant of the problem, where agents arrive in sequence and whenever an agent arrives, they have to be assigned to a coalition immediately and irrevocably. The scarce existing literature on online coalition formation has focused on the objective of maximizing social welfare, a demanding requirement, even in the offline setting. Instead, we seek to achieve stable coalition structures in an online setting, and focus on stability concepts based on deviations by single agents. We present a comprehensive picture in additively separable hedonic games, leading to dichotomies, where positive results are obtained by deterministic algorithms and negative results even hold for randomized algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos b/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos new file mode 100644 index 0000000000..2008fc832c --- /dev/null +++ b/data/2024/aaai/Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos @@ -0,0 +1,2 @@ +The behaviour of multi agent learning in competitive network games is often studied within the context of zero sum games, in which convergence guarantees may be obtained. However, outside of this class the behaviour of learning is known to display complex behaviours and convergence cannot be always guaranteed. Nonetheless, in order to develop a complete picture of the behaviour of multi agent learning in competitive settings, the zero sum assumption must be lifted. +Motivated by this we study the Q Learning dynamics, a popular model of exploration and exploitation in multi agent learning, in competitive network games. We determine how the degree of competition, exploration rate and network connectivity impact the convergence of Q Learning. To study generic competitive games, we parameterise network games in terms of correlations between agent payoffs and study the average behaviour of the Q Learning dynamics across all games drawn from a choice of this parameter. This statistical approach establishes choices of parameters for which Q Learning dynamics converge to a stable fixed point. Differently to previous works, we find that the stability of Q Learning is explicitly dependent only on the network connectivity rather than the total number of agents. Our experiments validate these findings and show that, under certain network structures, the total number of agents can be increased without increasing the likelihood of unstable or chaotic behaviours. \ No newline at end of file diff --git a/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies b/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies new file mode 100644 index 0000000000..45a0ac4ba5 --- /dev/null +++ b/data/2024/aaai/Stable Model Semantics for Description Logic Terminologies @@ -0,0 +1 @@ +This paper studies a stable model semantics for Description Logic (DL) knowledge bases (KBs) and for (possibly cyclic) terminologies, ultimately showing that terminologies under the proposed semantics can be equipped with effective reasoning algorithms. The semantics is derived using Quantified Equilibrium Logic, and---in contrast to the usual semantics of DLs based on classical logic---supports default negation and allows to combine the open-world and the closed-world assumptions in a natural way. Towards understanding the computational properties of this and related formalisms, we show a strong undecidability result that applies not only to KBs under the stable model semantics, but also to the more basic setting of minimal model reasoning. Specifically, we show that concept satisfiability in minimal models of an ALCIO KB is undecidable. We then turn our attention to (possibly cyclic) DL terminologies, where ontological axioms are limited to definitions of concept names in terms of complex concepts. This restriction still yields a very rich setting. We show that standard reasoning problems, like concept satisfiability and subsumption, are ExpTime-complete for terminologies expressed in ALCI under the stable model semantics. \ No newline at end of file diff --git a/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise b/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise new file mode 100644 index 0000000000..5aebed87cd --- /dev/null +++ b/data/2024/aaai/Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise @@ -0,0 +1 @@ +The open sourcing of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of data, a poisoning-based technique, "unlearnable example", has been proposed to significantly degrade the generalization performance of models by adding imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial perturbation on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce Stable Error-Minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through comprehensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset regarding both effectiveness and efficiency. \ No newline at end of file diff --git a/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference b/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference new file mode 100644 index 0000000000..4d4b51270e --- /dev/null +++ b/data/2024/aaai/Statistical Spatially Inhomogeneous Diffusion Inference @@ -0,0 +1,4 @@ +Inferring a diffusion equation from discretely observed measurements is a statistical challenge of significant importance in a variety of fields, from single-molecule tracking in biophysical systems to modeling financial instruments. +Assuming that the underlying dynamical process obeys a d-dimensional stochastic differential equation of the form dx_t = b(x_t)dt + \Sigma(x_t)dw_t, we propose neural network-based estimators of both the drift b and the spatially-inhomogeneous diffusion tensor D = \Sigma\Sigma^T/2 and provide statistical convergence guarantees when b and D are s-Hölder continuous. +Notably, our bound aligns with the minimax optimal rate N^{-\frac{2s}{2s+d}} for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds. +Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors. \ No newline at end of file diff --git a/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation b/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation new file mode 100644 index 0000000000..c4d4438f1b --- /dev/null +++ b/data/2024/aaai/Statistically Principled Deep Learning for SAR Image Segmentation @@ -0,0 +1 @@ +This paper proposes a novel approach for Synthetic Aperture Radar (SAR) image segmentation by incorporating known statistical properties of SAR into deep learning models. We generate synthetic data using the Generalized Gamma distribution, modify the U-Net architecture to encompass statistical moments, and employ stochastic distance losses for improved segmentation performance. Evaluation against traditional methods will reveal the potential of this approach to advance SAR image analysis, with broader applications in environmental monitoring and general image segmentation tasks. \ No newline at end of file diff --git a/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits b/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits new file mode 100644 index 0000000000..7e18a1add4 --- /dev/null +++ b/data/2024/aaai/Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits @@ -0,0 +1 @@ +Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms. \ No newline at end of file diff --git a/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography b/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography new file mode 100644 index 0000000000..1a33c2a0b5 --- /dev/null +++ b/data/2024/aaai/StegFormer: Rebuilding the Glory of Autoencoder-Based Steganography @@ -0,0 +1 @@ +Image hiding aims to conceal one or more secret images within a cover image of the same resolution. Due to strict capacity requirements, image hiding is commonly called large-capacity steganography. In this paper, we propose StegFormer, a novel autoencoder-based image-hiding model. StegFormer can conceal one or multiple secret images within a cover image of the same resolution while preserving the high visual quality of the stego image. In addition, to mitigate the limitations of current steganographic models in real-world scenarios, we propose a normalizing training strategy and a restrict loss to improve the reliability of the steganographic models under realistic conditions. Furthermore, we propose an efficient steganographic capacity expansion method to increase the capacity of steganography and enhance the efficiency of secret communication. Through this approach, we can increase the relative payload of StegFormer to 96 bits per pixel without any training strategy modifications. Experiments demonstrate that our StegFormer outperforms existing state-of-the-art (SOTA) models. In the case of single-image steganography, there is an improvement of more than 3 dB and 5 dB in PSNR for secret/recovery image pairs and cover/stego image pairs. \ No newline at end of file diff --git a/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models b/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models new file mode 100644 index 0000000000..f9a6c5434a --- /dev/null +++ b/data/2024/aaai/Step Vulnerability Guided Mean Fluctuation Adversarial Attack against Conditional Diffusion Models @@ -0,0 +1 @@ +The high-quality generation results of conditional diffusion models have brought about concerns regarding privacy and copyright issues. As a possible technique for preventing the abuse of diffusion models, the adversarial attack against diffusion models has attracted academic attention recently. In this work, utilizing the phenomenon that diffusion models are highly sensitive to the mean value of the input noise, we propose the Mean Fluctuation Attack (MFA) to introduce mean fluctuations by shifting the mean values of the estimated noises during the reverse process. In addition, we reveal that the vulnerability of different reverse steps against adversarial attacks actually varies significantly. By modeling the step vulnerability and using it as guidance to sample the target steps for generating adversarial examples, the effectiveness of adversarial attacks can be substantially enhanced. Extensive experiments show that our algorithm can steadily cause the mean shift of the predicted noises so as to disrupt the entire reverse generation process and degrade the generation results significantly. We also demonstrate that the step vulnerability is intrinsic to the reverse process by verifying its effectiveness in an attack method other than MFA. Code and Supplementary is available at https://github.com/yuhongwei22/MFA \ No newline at end of file diff --git a/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images b/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images new file mode 100644 index 0000000000..3498749f4c --- /dev/null +++ b/data/2024/aaai/Stereo Vision Conversion from Planar Videos Based on Temporal Multiplane Images @@ -0,0 +1 @@ +With the rapid development of 3D movie and light-field displays, there is a growing demand for stereo videos. However, generating high-quality stereo videos from planar videos remains a challenging task. Traditional depth-image-based rendering techniques struggle to effectively handle the problem of occlusion exposure, which occurs when the occluded contents become visible in other views. Recently, the single-view multiplane images (MPI) representation has shown promising performance for planar video stereoscopy. However, the MPI still lacks real details that are occluded in the current frame, resulting in blurry artifacts in occlusion exposure regions. In fact, planar videos can leverage complementary information from adjacent frames to predict a more complete scene representation for the current frame. Therefore, this paper extends the MPI from still frames to the temporal domain, introducing the temporal MPI (TMPI). By extracting complementary information from adjacent frames based on optical flow guidance, obscured regions in the current frame can be effectively repaired. Additionally, a new module called masked optical flow warping (MOFW) is introduced to improve the propagation of pixels along optical flow trajectories. Experimental results demonstrate that the proposed method can generate high-quality stereoscopic or light-field videos from a single view and reproduce better occluded details than other state-of-the-art (SOTA) methods. https://github.com/Dio3ding/TMPI \ No newline at end of file diff --git a/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs b/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs new file mode 100644 index 0000000000..bc6b04ead9 --- /dev/null +++ b/data/2024/aaai/Sterling: Synergistic Representation Learning on Bipartite Graphs @@ -0,0 +1 @@ +A fundamental challenge of bipartite graph representation learning is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique local and global synergies in bipartite graphs. The local synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and the global synergies are captured by maximizing the mutual information of co-clusters. Theoretical analysis demonstrates that STERLING could improve the connectivity between different node types in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings. \ No newline at end of file diff --git a/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training b/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training new file mode 100644 index 0000000000..cd18e1640b --- /dev/null +++ b/data/2024/aaai/Stitching Segments and Sentences towards Generalization in Video-Text Pre-training @@ -0,0 +1 @@ +Video-language pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models. \ No newline at end of file diff --git a/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting b/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting new file mode 100644 index 0000000000..15e7933824 --- /dev/null +++ b/data/2024/aaai/StockMixer: A Simple Yet Strong MLP-Based Architecture for Stock Price Forecasting @@ -0,0 +1 @@ +Stock price forecasting is a fundamental yet challenging task in quantitative investment. Various researchers have developed a combination of neural network models (e.g., RNNs, GNNs, Transformers) for capturing complex indicator, temporal and stock correlations of the stock data.While complex architectures are highly expressive, they are often difficult to optimize and the performances are often compromised by the limited stock data. In this paper, we propose a simple MLP-based architecture named StockMixer which is easy to optimize and enjoys strong predictive performance. StockMixer performs indicator mixing, followed by time mixing, and finally stock mixing. Unlike the standard MLP-based mixing, we devise the time mixing to exchange multi-scale time patch information and realize the stock mixing by exploiting stock-to-market and market-to-stock influences explicitly. Extensive experiments on real stock benchmarks demonstrate our proposed StockMixer outperforms various state-of-the-art forecasting methods with a notable margin while reducing memory usage and runtime cost.Code is available at https://github.com/SJTU-Quant/StockMixer. \ No newline at end of file diff --git a/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles b/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles new file mode 100644 index 0000000000..4973b1aae2 --- /dev/null +++ b/data/2024/aaai/Stop! Planner Time: Metareasoning for Probabilistic Planning Using Learned Performance Profiles @@ -0,0 +1 @@ +The metareasoning framework aims to enable autonomous agents to factor in planning costs when making decisions. In this work, we develop the first non-myopic metareasoning algorithm for planning with Markov decision processes. Our method learns the behaviour of anytime probabilistic planning algorithms from performance data. Specifically, we propose a novel model for metareasoning, based on contextual performance profiles that predict the value of the planner's current solution given the time spent planning, the state of the planning algorithm's internal parameters, and the difficulty of the planning problem being solved. This model removes the need to assume that the current solution quality is always known, broadening the class of metareasoning problems that can be addressed. We then employ deep reinforcement learning to learn a policy that decides, at each timestep, whether to continue planning or start executing the current plan, and how to set hyperparameters of the planner to enhance its performance. We demonstrate our algorithm's ability to perform effective metareasoning in two domains. \ No newline at end of file diff --git a/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract) b/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract) new file mode 100644 index 0000000000..0d254def80 --- /dev/null +++ b/data/2024/aaai/Strategic Recommendation: Revenue Optimal Matching for Online Platforms (Student Abstract) @@ -0,0 +1,3 @@ +We consider a platform in a two-sided market with unit-supply sellers and unit-demand buyers. Each buyer can transact with a subset of sellers it knows off platform and another seller that the platform recommends. Given the choice of sellers, transactions and prices form a competitive equilibrium. The platform selects one seller for each buyer, and charges a fixed percentage of prices to all transactions that it recommends. The platform seeks to maximize total revenue. + +We show that the platform's problem is NP-hard, even when each buyer knows at most two buyers off platform. Finally, when each buyer values all sellers equally and knows only one buyer off platform, we provide a polynomial time algorithm that optimally solves the problem. \ No newline at end of file diff --git a/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems b/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems new file mode 100644 index 0000000000..dbc65115c5 --- /dev/null +++ b/data/2024/aaai/Strategyproof Mechanisms for Group-Fair Obnoxious Facility Location Problems @@ -0,0 +1 @@ +We study the group-fair obnoxious facility location problems from the mechanism design perspective where agents belong to different groups and have private location preferences on the undesirable locations of the facility. Our main goal is to design strategyproof mechanisms that elicit the true location preferences from the agents and determine a facility location that approximately optimizes several group-fair objectives. We first consider the maximum total and average group cost (group-fair) objectives. For these objectives, we propose deterministic mechanisms that achieve 3-approximation ratios and provide matching lower bounds. We then provide the characterization of 2-candidate strategyproof randomized mechanisms. Leveraging the characterization, we design randomized mechanisms with improved approximation ratios of 2 for both objectives. We also provide randomized lower bounds of 5/4 for both objectives. Moreover, we investigate intergroup and intragroup fairness (IIF) objectives, addressing fairness between groups and within each group. We present a mechanism that achieves a 4-approximation for the IIF objectives and provide tight lower bounds. \ No newline at end of file diff --git a/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion b/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion new file mode 100644 index 0000000000..47d0b65437 --- /dev/null +++ b/data/2024/aaai/Stratified GNN Explanations through Sufficient Expansion @@ -0,0 +1 @@ +Explaining the decisions made by Graph Neural Networks (GNNs) is vital for establishing trust and ensuring fairness in critical applications such as medicine and science. The prevalence of hierarchical structure in real-world graphs/networks raises an important question on GNN interpretability: "On each level of the graph structure, which specific fraction imposes the highest influence over the prediction?" Currently, the prevailing two categories of methods are incapable of achieving multi-level GNN explanation due to their flat or motif-centric nature. In this work, we formulate the problem of learning multi-level explanations out of GNN models and introduce a stratified explainer module, namely STFExplainer, that utilizes the concept of sufficient expansion to generate explanations on each stratum. Specifically, we learn a higher-level subgraph generator by leveraging both hierarchical structure and GNN-encoded input features. Experiment results on both synthetic and real-world datasets demonstrate the superiority of our stratified explainer on standard interpretability tasks and metrics such as fidelity and explanation recall, with an average improvement of 11% and 8% over the best alternative on each data type. The case study on material domains also confirms the value of our approach through detected multi-level graph patterns accurately reconstructing the knowledge-based ground truth. \ No newline at end of file diff --git a/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning b/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning new file mode 100644 index 0000000000..ce6bc319a5 --- /dev/null +++ b/data/2024/aaai/Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning @@ -0,0 +1 @@ +Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including Meta-Dataset and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the LayerNorm parameters (which we call LN-Tune) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call Attn-Scale) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being ~9x more parameter-efficient) on Meta-Dataset. Our empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC. \ No newline at end of file diff --git a/data/2024/aaai/Stronger and Transferable Node Injection Attacks b/data/2024/aaai/Stronger and Transferable Node Injection Attacks new file mode 100644 index 0000000000..6371a47133 --- /dev/null +++ b/data/2024/aaai/Stronger and Transferable Node Injection Attacks @@ -0,0 +1 @@ +Despite the increasing popularity of graph neural networks (GNNs), the security risks associated with their deployment have not been well explored. Existing works follow the standard adversarial attacks to maximize cross-entropy loss within an L-infinity norm bound. We analyze the robustness of GNNs against node injection attacks (NIAs) in black-box settings by allowing new nodes to be injected and attacked. In this work, we propose to design stronger and transferable NIAs. First, we propose margin aware attack (MAA) that uses a maximum margin loss to generate NIAs. We then propose a novel margin and direction aware attack (MDA) that diversifies the initial directions of MAA attack by minimizing the cosine similarity of the injected nodes with respect to their respective random initialization in addition to the maximization of max-margin loss. This makes the NIAs stronger. We further observe that using L2 norm of gradients in the attack step leads to an enhanced diversity amongst the node features, thereby further enhancing the strength of the attack. We incorporate transferability in NIAs by perturbing the surrogate model before generating the attack. An analysis of eigen spectrum density of the hessian of the loss emphasizes that perturbing the weights of the surrogate model improves the transferability. Our experimental results demonstrate that the proposed resilient node injection attack (R-NIA) consistently outperform PGD by margins about 7-15% on both large and small graph datasets. R-NIA is significantly stronger and transferable than existing NIAs on graph robustness benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification b/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification new file mode 100644 index 0000000000..a9f564d1c8 --- /dev/null +++ b/data/2024/aaai/Structural Entropy Based Graph Structure Learning for Node Classification @@ -0,0 +1 @@ +As one of the most common tasks in graph data analysis, node classification is frequently solved by using graph structure learning (GSL) techniques to optimize graph structures and learn suitable graph neural networks. Most of the existing GSL methods focus on fusing different structural features (basic views) extracted from the graph, but very little graph semantics, like hierarchical communities, has been incorporated. Thus, they might be insufficient when dealing with the graphs containing noises from real-world complex systems. To address this issue, we propose a novel and effective GSL framework for node classification based on the structural information theory. Specifically, we first prove that an encoding tree with the minimal structural entropy could contain sufficient information for node classification and eliminate redundant noise via the graph's hierarchical abstraction. Then, we provide an efficient algorithm for constructing the encoding tree to enhance the basic views. Combining the community influence deduced from the encoding tree and the prediction confidence of each view, we further fuse the enhanced views to generate the optimal structure. Finally, we conduct extensive experiments on a variety of datasets. The results demonstrate that our method outperforms the state-of-the-art competitors on effectiveness and robustness. \ No newline at end of file diff --git a/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction b/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction new file mode 100644 index 0000000000..f0c9721e7f --- /dev/null +++ b/data/2024/aaai/Structural Information Enhanced Graph Representation for Link Prediction @@ -0,0 +1 @@ +Link prediction is a fundamental task of graph machine learning, and Graph Neural Network (GNN) based methods have become the mainstream approach due to their good performance. However, the typical practice learns node representations through neighborhood aggregation, lacking awareness of the structural relationships between target nodes. Recently, some methods have attempted to address this issue by node labeling tricks. However, they still rely on the node-centric neighborhood message passing of GNNs, which we believe involves two limitations in terms of information perception and transmission for link prediction. First, it cannot perceive long-range structural information due to the restricted receptive fields. Second, there may be information loss of node-centric model on link-centric task. In addition, we empirically find that the neighbor node features could introduce noise for link prediction. To address these issues, we propose a structural information enhanced link prediction framework, which involves removing the neighbor node features while fitting neighborhood graph structures more focused through GNN. Furthermore, we introduce Binary Structural Transformer (BST) to encode the structural relationships between target nodes, complementing the deficiency of GNN. Our approach achieves remarkable results on multiple popular benchmarks, including ranking first on ogbl-ppa, ogbl-citation2 and Pubmed. \ No newline at end of file diff --git a/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception b/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception new file mode 100644 index 0000000000..757efc3ce9 --- /dev/null +++ b/data/2024/aaai/Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception @@ -0,0 +1 @@ +Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE. \ No newline at end of file diff --git a/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract) b/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract) new file mode 100644 index 0000000000..5ef01ef3f5 --- /dev/null +++ b/data/2024/aaai/Structurally Guided Task Decomposition in Spatial Navigation Tasks (Student Abstract) @@ -0,0 +1 @@ +How are people able to plan so efficiently despite limited cognitive resources? We aimed to answer this question by extending an existing model of human task decomposition that can explain a wide range of simple planning problems by adding structure information to the task to facilitate planning in more complex tasks. The extended model was then applied to a more complex planning domain of spatial navigation. Our results suggest that our framework can correctly predict the navigation strategies of the majority of the participants in an online experiment. \ No newline at end of file diff --git a/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog b/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog new file mode 100644 index 0000000000..7057baf7b9 --- /dev/null +++ b/data/2024/aaai/Structure-Aware Multimodal Sequential Learning for Visual Dialog @@ -0,0 +1 @@ +With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET. \ No newline at end of file diff --git a/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations b/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations new file mode 100644 index 0000000000..c80dc1b888 --- /dev/null +++ b/data/2024/aaai/Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations @@ -0,0 +1 @@ +Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP. \ No newline at end of file diff --git a/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming b/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming new file mode 100644 index 0000000000..3fd23abbe7 --- /dev/null +++ b/data/2024/aaai/Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming @@ -0,0 +1 @@ +The rapid evolution of artificial intelligence (AI), specifically large language models (LLMs), has opened opportunities for various educational applications. This paper explored the feasibility of utilizing ChatGPT, one of the most popular LLMs, for automating feedback for Java programming assignments in an introductory computer science (CS1) class. Specifically, this study focused on three questions: 1) To what extent do students view LLM-generated feedback as formative? 2) How do students see the comparative affordances of feedback prompts that include their code, vs. those that exclude it? 3) What enhancements do students suggest for improving LLM-generated feedback? To address these questions, we generated automated feedback using the ChatGPT API for four lab assignments in a CS1 class. The survey results revealed that students perceived the feedback as aligning well with formative feedback guidelines established by Shute. Additionally, students showed a clear preference for feedback generated by including the students' code as part of the LLM prompt, and our thematic study indicated that the preference was mainly attributed to the specificity, clarity, and corrective nature of the feedback. Moreover, this study found that students generally expected specific and corrective feedback with sufficient code examples, but had diverged opinions on the tone of the feedback. This study demonstrated that ChatGPT could generate Java programming assignment feedback that students perceived as formative. It also offered insights into the specific improvements that would make the ChatGPT-generated feedback useful for students. \ No newline at end of file diff --git a/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style b/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style new file mode 100644 index 0000000000..a0047f4a8f --- /dev/null +++ b/data/2024/aaai/Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style @@ -0,0 +1 @@ +Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audio-visual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style. \ No newline at end of file diff --git a/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis b/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis new file mode 100644 index 0000000000..7edaf1c6ba --- /dev/null +++ b/data/2024/aaai/StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis @@ -0,0 +1 @@ +Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/Submodel Enumeration for CTL Is Hard b/data/2024/aaai/Submodel Enumeration for CTL Is Hard new file mode 100644 index 0000000000..0607d0ff65 --- /dev/null +++ b/data/2024/aaai/Submodel Enumeration for CTL Is Hard @@ -0,0 +1 @@ +Expressing system specifications using Computation Tree Logic (CTL) formulas, formalising programs using Kripke structures, and then model checking the system is an established workflow in program verification and has wide applications in AI. In this paper, we consider the task of model enumeration, which asks for a uniform stream of output systems that satisfy the given specification. We show that, given a CTL formula and a system (potentially falsified by the formula), enumerating satisfying submodels is always hard for CTL--regardless of which subset of CTL-operators is considered. As a silver lining on the horizon, we present fragments via restrictions on the allowed Boolean functions that still allow for fast enumeration. \ No newline at end of file diff --git a/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation b/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation new file mode 100644 index 0000000000..147672c5a6 --- /dev/null +++ b/data/2024/aaai/Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation @@ -0,0 +1 @@ +Existing approaches usually perform spatiotemporal representation in the spatial and temporal dimensions, respectively, which isolates the spatial and temporal natures of the target and leads to sub-optimal embeddings. Neuroscience research has shown that the mammalian brain entorhinal-hippocampal system provides efficient graph representations for general knowledge. Moreover, entorhinal grid cells present concise spatial representations, while hippocampal place cells represent perception conjunctions effectively. Thus, the entorhinal-hippocampal system provides a novel angle for spatiotemporal representation, which inspires us to propose the SpatioTemporal aware Embedding framework (STE) and apply it to POIs (STEP). STEP considers two types of POI-specific representations: sequential representation and spatiotemporal conjunctive representation, learned using sparse unlabeled data based on the proposed graph-building policies. Notably, STEP jointly represents the spatiotemporal natures of POIs using both observations and contextual information from integrated spatiotemporal dimensions by constructing a spatiotemporal context graph. Furthermore, we introduce a successive POI recommendation method using STEP, which achieves state-of-the-art performance on two benchmarks. In addition, we demonstrate the excellent performance of the STE representation approach in other spatiotemporal representation-centered tasks through a case study of the traffic flow prediction problem. Therefore, this work provides a novel solution to spatiotemporal representation and paves a new way for spatiotemporal modeling-related tasks. \ No newline at end of file diff --git a/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning b/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning new file mode 100644 index 0000000000..a093fd6109 --- /dev/null +++ b/data/2024/aaai/Summarizing Stream Data for Memory-Constrained Online Continual Learning @@ -0,0 +1 @@ +Replay-based methods have proved their effectiveness on online continual learning by rehearsing past samples from an auxiliary memory. With many efforts made on improving training schemes based on the memory, however, the information carried by each sample in the memory remains under-investigated. Under circumstances with restricted storage space, the informativeness of the memory becomes critical for effective replay. Although some works design specific strategies to select representative samples, by only employing a small number of original images, the storage space is still not well utilized. To this end, we propose to Summarize the knowledge from the Stream Data (SSD) into more informative samples by distilling the training characteristics of real images. Through maintaining the consistency of training gradients and relationship to the past tasks, the summarized samples are more representative for the stream data compared to the original images. Extensive experiments are conducted on multiple online continual learning benchmarks to support that the proposed SSD method significantly enhances the replay effects. We demonstrate that with limited extra computational overhead, SSD provides more than 3% accuracy boost for sequential CIFAR-100 under extremely restricted memory buffer. Code in https://github.com/vimar-gu/SSD. \ No newline at end of file diff --git a/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection b/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection new file mode 100644 index 0000000000..de898423c4 --- /dev/null +++ b/data/2024/aaai/Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection @@ -0,0 +1 @@ +LiDAR-based 3D object detection models inevitably struggle under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain simulation method, termed DRET, that unifies Dynamics and Rainy Environment Theory to provide a cost-effective means of expanding the available realistic rain data for 3D detection training. Furthermore, we present a Sunny-to-Rainy Knowledge Distillation (SRKD) approach to enhance 3D detection under rainy conditions. Extensive experiments on the Waymo-Open-Dataset show that, when combined with the state-of-the-art DSVT model and other classical 3D detectors, our proposed framework demonstrates significant detection accuracy improvements, without losing efficiency. Remarkably, our framework also improves detection capabilities under sunny conditions, therefore offering a robust solution for 3D detection regardless of whether the weather is rainy or sunny. \ No newline at end of file diff --git a/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration b/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration new file mode 100644 index 0000000000..d3671c5e39 --- /dev/null +++ b/data/2024/aaai/SuperJunction: Learning-Based Junction Detection for Retinal Image Registration @@ -0,0 +1 @@ +Keypoints-based approaches have shown to be promising for retinal image registration, which superimpose two or more images from different views based on keypoint detection and description. However, existing approaches suffer from ineffective keypoint detector and descriptor training. Meanwhile, the non-linear mapping from 3D retinal structure to 2D images is often neglected. In this paper, we propose a novel learning-based junction detection approach for retinal image registration, which enhances both the keypoint detector and descriptor training. To improve the keypoint detection, it uses a multi-task vessel detection to regularize the model training, which helps to learn more representative features and reduce the risk of over-fitting. To achieve effective training for keypoints description, a new constrained negative sampling approach is proposed to compute the descriptor loss. Moreover, we also consider the non-linearity between retinal images from different views during matching. Experimental results on FIRE dataset show that our method achieves mean area under curve of 0.850, which is 12.6% higher than 0.755 by the state-of-the-art method. All the codes are available at https://github.com/samjcheng/SuperJunction. \ No newline at end of file diff --git a/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures b/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures new file mode 100644 index 0000000000..9f8d4fafdb --- /dev/null +++ b/data/2024/aaai/Superposed Atomic Representation for Robust High-Dimensional Data Recovery of Multiple Low-Dimensional Structures @@ -0,0 +1 @@ +This paper proposes a unified Superposed Atomic Representation (SAR) framework for high-dimensional data recovery with multiple low-dimensional structures. The data can be in various forms ranging from vectors to tensors. The goal of SAR is to recover different components from their sum, where each component has a low-dimensional structure, such as sparsity, low-rankness or be lying a low-dimensional subspace. Examples of SAR include, but not limited to, Robust Sparse Representation (RSR), Robust Principal Component Analysis (RPCA), Tensor RPCA (TRPCA), and Outlier Pursuit (OP). We establish the theoretical guarantee for SAR. To further improve SAR, we also develop a Weighted SAR (WSAR) framework by paying more attention and penalizing less on significant atoms of each component. An effective optimization algorithm is devised for WSAR and the convergence of the algorithm is rigorously proved. By leveraging WSAR as a general platform, several new methods are proposed for high-dimensional data recovery. The experiments on real data demonstrate the superiority of WSAR for various data recovery problems. \ No newline at end of file diff --git a/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond b/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond new file mode 100644 index 0000000000..f1ed16fedf --- /dev/null +++ b/data/2024/aaai/Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond @@ -0,0 +1 @@ +The success of data mixing augmentations in image classification tasks has been well-received. However, these techniques cannot be readily applied to object detection due to challenges such as spatial misalignment, foreground/background distinction, and plurality of instances. To tackle these issues, we first introduce a novel conceptual framework called Supervision Interpolation (SI), which offers a fresh perspective on interpolation-based augmentations by relaxing and generalizing Mixup. Based on SI, we propose LossMix, a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors and more. Our key insight is that we can effectively regularize the training on mixed data by interpolating their loss errors instead of ground truth labels. Empirical results on the PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently outperform state-of-the-art methods widely adopted for detection. Furthermore, by jointly leveraging LossMix with unsupervised domain adaptation, we successfully improve existing approaches and set a new state of the art for cross-domain object detection. \ No newline at end of file diff --git a/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning b/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning new file mode 100644 index 0000000000..309ceff8f1 --- /dev/null +++ b/data/2024/aaai/Supporting Upper Elementary Students in Learning AI Concepts with Story-Driven Game-Based Learning @@ -0,0 +1 @@ +Artificial intelligence (AI) is quickly finding broad application in every sector of society. This rapid expansion of AI has increased the need to cultivate an AI-literate workforce, and it calls for introducing AI education into K-12 classrooms to foster students’ awareness and interest in AI. With rich narratives and opportunities for situated problem solving, story-driven game-based learning offers a promising approach for creating engaging and effective K-12 AI learning experiences. In this paper, we present our ongoing work to iteratively design, develop, and evaluate a story-driven game-based learning environment focused on AI education for upper elementary students (ages 8 to 11). The game features a science inquiry problem centering on an endangered species and incorporates a Use-Modify-Create scaffolding framework to promote student learning. We present findings from an analysis of data collected from 16 students playing the game's quest focused on AI planning. Results suggest that the scaffolding framework provided students with the knowledge they needed to advance through the quest and that overall, students experienced positive learning outcomes. \ No newline at end of file diff --git a/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation b/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation new file mode 100644 index 0000000000..c5587a175a --- /dev/null +++ b/data/2024/aaai/Suppressing Uncertainty in Gaze Estimation @@ -0,0 +1 @@ +Uncertainty in gaze estimation manifests in two aspects: 1) low-quality images caused by occlusion, blurriness, inconsistent eye movements, or even non-face images; 2) uncorrected labels resulting from the misalignment between the labeled and actual gaze points during the annotation process. Allowing these uncertainties to participate in training hinders the improvement of gaze estimation. To tackle these challenges, in this paper, we propose an effective solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which introduces a novel triplet-label consistency measurement to estimate and reduce the uncertainties. Specifically, for each training sample, we propose to estimate a novel ``neighboring label'' calculated by a linearly weighted projection from the neighbors to capture the similarity relationship between image features and their corresponding labels, which can be incorporated with the predicted pseudo label and ground-truth label for uncertainty estimation. By modeling such triplet-label consistency, we can largely reduce the negative effects of unqualified images and wrong labels through our designed sample weighting and label correction strategies. Experimental results on the gaze estimation benchmarks indicate that our proposed SUGE achieves state-of-the-art performance. \ No newline at end of file diff --git a/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation b/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation new file mode 100644 index 0000000000..f375a67ffc --- /dev/null +++ b/data/2024/aaai/SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation @@ -0,0 +1 @@ +The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to inferior generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM’s pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code is available at https://github.com/wenxi-yue/SurgicalSAM. \ No newline at end of file diff --git a/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning b/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning new file mode 100644 index 0000000000..829cd27b54 --- /dev/null +++ b/data/2024/aaai/Sustainability of Data Center Digital Twins with Reinforcement Learning @@ -0,0 +1 @@ +In recent years, the increasing emphasis on sustainability and carbon footprint reduction has required the exploration of innovative optimization techniques for data center operators. In this paper, we introduce a Concurrent Carbon Footprint Reduction (C2FR) Reinforcement Learning framework, designed to optimize data center energy consumption, load shifting, and battery operation decisions in real time. The C2FR framework utilizes short-term forecasts and incorporates Reinforcement Learning Energy ($A_{E}$), Battery ($A_{BAT}$) and Load-Shifting ($A_{LS}$) agents to optimize and effectively manage the intricate dependencies and information exchange between these individual optimization strategies, thus overcoming the limitations of existing isolated approaches. When compared to state-of-the-art algorithms, the C2FR framework demonstrates its effectiveness across various data center scenarios. The AE agent achieves a 7.9% reduction in pollutant emissions and a 7.8% reduction in energy cost on average. Moreover, the C2FR framework enables further emission reductions through the application of the battery and load-shifting optimization, leading to a total reduction of 10.17% in pollutant emissions on average over different data center configurations. This highlights the potential of the C2FR framework in addressing data center sustainability challenges and improving real-time carbon footprint optimization. \ No newline at end of file diff --git a/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes b/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes new file mode 100644 index 0000000000..a58d7cb030 --- /dev/null +++ b/data/2024/aaai/Swift-Mapping: Online Neural Implicit Dense Mapping in Urban Scenes @@ -0,0 +1 @@ +Online dense mapping of urban scenes is of paramount importance for scene understanding of autonomous navigation. Traditional online dense mapping methods fuse sensor measurements (vision, lidar, etc.) across time and space via explicit geometric correspondence. Recently, NeRF-based methods have proved the superiority of neural implicit representations by high-fidelity reconstruction of large-scale city scenes. However, it remains an open problem how to integrate powerful neural implicit representations into online dense mapping. Existing methods are restricted to constrained indoor environments and are too computationally expensive to meet online requirements. To this end, we propose Swift-Mapping, an online neural implicit dense mapping framework in urban scenes. We introduce a novel neural implicit octomap (NIO) structure that provides efficient neural representation for large and dynamic urban scenes while retaining online update capability. Based on that, we propose an online neural dense mapping framework that effectively manages and updates neural octree voxel features. Our approach achieves SOTA reconstruction accuracy while being more than 10x faster in reconstruction speed, demonstrating the superior performance of our method in both accuracy and efficiency. \ No newline at end of file diff --git a/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection b/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection new file mode 100644 index 0000000000..0a7c5e7322 --- /dev/null +++ b/data/2024/aaai/SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection @@ -0,0 +1 @@ +Lidar-based 3D Detection is one of the significant components of Autonomous Driving. However, current methods over-focus on improving the performance of 3D Lidar perception, which causes the architecture of networks becoming complicated and hard to deploy. Thus, the methods are difficult to apply in Autonomous Driving for real-time processing. In this paper, we propose a high-efficiency network, SwiftPillars, which includes Swift Pillar Encoder (SPE) and Multi-scale Aggregation Decoder (MAD). The SPE is constructed by a concise Dual-attention Module with lightweight operators. The Dual-attention Module utilizes feature pooling, matrix multiplication, etc. to speed up point-wise and channel-wise attention extraction and fusion. The MAD interconnects multiple scale features extracted by SPE with minimal computational cost to leverage performance. In our experiments, our proposal accomplishes 61.3% NDS and 53.2% mAP in nuScenes dataset. In addition, we evaluate inference time on several platforms (P4, T4, A2, MLU370, RTX3080), where SwiftPillars achieves up to 13.3ms (75FPS) on NVIDIA Tesla T4. Compared with PointPillars, SwiftPillars is on average 26.58% faster in inference speed with equivalent GPUs and a higher mAP of approximately 3.2% in the nuScenes dataset. \ No newline at end of file diff --git a/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners b/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners new file mode 100644 index 0000000000..984388597d --- /dev/null +++ b/data/2024/aaai/SwitchTab: Switched Autoencoders Are Effective Tabular Learners @@ -0,0 +1 @@ +Self-supervised representation learning methods have achieved significant success in computer vision and natural language processing (NLP), where data samples exhibit explicit spatial or semantic dependencies. However, applying these methods to tabular data is challenging due to the less pronounced dependencies among data samples. In this paper, we address this limitation by introducing SwitchTab, a novel self-supervised method specifically designed to capture latent dependencies in tabular data. SwitchTab leverages an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings. These embeddings, in turn, contribute to better decision boundaries and lead to improved results in downstream tasks. To validate the effectiveness of SwitchTab, we conduct extensive experiments across various domains involving tabular data. The results showcase superior performance in end-to-end prediction tasks with fine-tuning. Moreover, we demonstrate that pre-trained salient embeddings can be utilized as plug-and-play features to enhance the performance of various traditional classification methods (e.g., Logistic Regression, XGBoost, etc.). Lastly, we highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space. \ No newline at end of file diff --git a/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting b/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting new file mode 100644 index 0000000000..e50b2197b0 --- /dev/null +++ b/data/2024/aaai/SyFormer: Structure-Guided Synergism Transformer for Large-Portion Image Inpainting @@ -0,0 +1 @@ +Image inpainting is in full bloom accompanied by the progress of convolutional neural networks (CNNs) and transformers, revolutionizing the practical management of abnormity disposal, image editing, etc. However, due to the ever-mounting image resolutions and missing areas, the challenges of distorted long-range dependencies from cluttered background distributions and reduced reference information in image domain inevitably rise, which further cause severe performance degradation. To address the challenges, we propose a novel large-portion image inpainting approach, namely the Structure-Guided Synergism Transformer (SyFormer), to rectify the discrepancies in feature representation and enrich the structural cues from limited reference. Specifically, we devise a dual-routing filtering module that employs a progressive filtering strategy to eliminate invalid noise interference and establish global-level texture correlations. Simultaneously, the structurally compact perception module maps an affinity matrix within the introduced structural priors from a structure-aware generator, assisting in matching and filling the corresponding patches of large-proportionally damaged images. Moreover, we carefully assemble the aforementioned modules to achieve feature complementarity. Finally, a feature decoding alignment scheme is introduced in the decoding process, which meticulously achieves texture amalgamation across hierarchical features. Extensive experiments are conducted on two publicly available datasets, i.e., CelebA-HQ and Places2, to qualitatively and quantitatively demonstrate the superiority of our model over state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Symbol Description Reading b/data/2024/aaai/Symbol Description Reading new file mode 100644 index 0000000000..40738d4ffc --- /dev/null +++ b/data/2024/aaai/Symbol Description Reading @@ -0,0 +1 @@ +Mathematical formulas give concise representations of a document's key ideas in many natural sciences and engineering domains. The symbols that make up formulas carry semantic meaning that may differ by document or equation. What does ? mean in a given paper? Interpreting the symbols that comprise formulas requires identifying descriptions from the surrounding text. We approach this task of symbol description reading as an application of current AI technologies targeting the tuning of large language models for particular domains and automation of machine learning. Our pipeline integrates AI question answering and natural language processing to read symbol descriptions. We consider extractive and generative AI model variations and apply our pipeline on two example tasks of symbol description reading. Promising results provide motivation for wider deployment for which we describe a microservice architecture and related challenges. \ No newline at end of file diff --git a/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems b/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems new file mode 100644 index 0000000000..61d535c5d9 --- /dev/null +++ b/data/2024/aaai/Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems @@ -0,0 +1 @@ +Cognitive diagnosis assessment is a fundamental and crucial task for student learning. It models the student-exercise interaction, and discovers the students' proficiency levels on each knowledge attribute. In real-world intelligent education systems, generalization and interpretability of cognitive diagnosis methods are of equal importance. However, most existing methods can hardly make the best of both worlds due to the complicated student-exercise interaction. To this end, this paper proposes a symbolic cognitive diagnosis~(SCD) framework to simultaneously enhance generalization and interpretability. The SCD framework incorporates the symbolic tree to explicably represent the complicated student-exercise interaction function, and utilizes gradient-based optimization methods to effectively learn the student and exercise parameters. Meanwhile, the accompanying challenge is that we need to tunnel the discrete symbolic representation and continuous parameter optimization. To address this challenge, we propose to hybridly optimize the representation and parameters in an alternating manner. To fulfill SCD, it alternately learns the symbolic tree by derivative-free genetic programming and learns the student and exercise parameters via gradient-based Adam. The extensive experimental results on various real-world datasets show the superiority of SCD on both generalization and interpretability. The ablation study verifies the efficacy of each ingredient in SCD, and the case study explicitly showcases how the interpretable ability of SCD works. \ No newline at end of file diff --git a/data/2024/aaai/Symbolic Numeric Planning with Patterns b/data/2024/aaai/Symbolic Numeric Planning with Patterns new file mode 100644 index 0000000000..02497dbdff --- /dev/null +++ b/data/2024/aaai/Symbolic Numeric Planning with Patterns @@ -0,0 +1 @@ +In this paper, we propose a novel approach for solving linear numeric planning problems, called Symbolic Pattern Planning. Given a planning problem Pi, a bound n and a pattern --defined as an arbitrary sequence of actions-- we encode the problem of finding a plan for Pi with bound n as a formula with fewer variables and/or clauses than the state-of-the-art rolled-up and relaxed-relaxed-exists encodings. More importantly, we prove that for any given bound, it is never the case that the latter two encodings allow finding a valid plan while ours does not. On the experimental side, we consider 6 other planning systems --including the ones which participated in this year's International Planning Competition (IPC)-- and we show that our planner Patty has remarkably good comparative performances on this year's IPC problems. \ No newline at end of file diff --git a/data/2024/aaai/Symbolic Reasoning Methods for AI Planning b/data/2024/aaai/Symbolic Reasoning Methods for AI Planning new file mode 100644 index 0000000000..4c674d47d9 --- /dev/null +++ b/data/2024/aaai/Symbolic Reasoning Methods for AI Planning @@ -0,0 +1,29 @@ +Planning is the act of deliberative thinking before acting. +It is based on a symbolic model of the world and the options to act in it, usually defined in function-free first-order logic. +The task is to find a sequence of actions (a plan) that leads from a given current state to a desired goal state. +The basic, purely physical description may be augmented with a partially ordered grammar-like structure (a Hierarchical Task Network or HTN), which can describe expert knowledge, or practical, legal, or operational requirements. + + +In this talk, I will survey a variety of methods for automatically deriving plans using symbolic methods for planning -- from both my past and future research. +These symbolic methods -- in some sense -- translate planning problems into other, simpler symbolic representations and reason over them to find plans. + + +As a basis for these methods, I will firstly introduce relevant theoretical results on planning. +First, I will discuss the expressive power of planning formalisms (ECAI'14, ICAPS'16) and second, the computational complexity of HTN planning and related tasks such as HTN plan verification, plan modification, and plan recognition (ICAPS'15, ICAPS'16). + + +Based on these theoretical results, I will develop why SAT-based HTN planning is possible and how it can be implemented. +To this end, I will survey several of my publications at top-tier conferences, including papers at ICAPS'17, AAAI'18, AAAI'19, IJCAI'19, AAAI'20, and ICAPS'21 -- in which I developed an highly SAT-based planner for HTN problems including the ability to find optimal plans as well as the grounding as a preprocessing step. +Here I will also give an outlook on future developments and new ideas that I propose for SAT-based planning -- including the exploitation of structures in plan (e.g.\ landmarks or operator-counting constraints). + +Next, I will present the idea of expressing lifted classical planning as SAT (ICAPS'22). +The resulting planner LiSAT was the first lifted SAT-based planner -- and proved highly efficient and outperformed all other lifted planners at the time of publication. +Notably, LiSAT was the first planner (lifted or grounded) and still is the only one to solve the challenging OrganicSynthesis benchmark -- and could even prove optimality for all plans. +I will also outline future ideas to further improve the efficiency of LiSAT. + + +Lastly, I introduce the notion of planning with symbolic symbolic representations (AAAI'21 and ICAPS'23). +Here one uses Binary Decision Diagrams to encode large sets of states efficiently. +For expressing the additional structure encoded by HTNs, I show how BDDs can be suitably integrated into finite automata. +Based on this representation, an efficient and optimal planning algorithm can be derived. +Additionally, I show how this algorithm can be extended to also cover oversubscription planning. \ No newline at end of file diff --git a/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks b/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks new file mode 100644 index 0000000000..d66fd5ce27 --- /dev/null +++ b/data/2024/aaai/Symbolic Regression Enhanced Decision Trees for Classification Tasks @@ -0,0 +1 @@ +We introduce a conceptually simple yet effective method to create small, compact decision trees - by using splits found via Symbolic Regression (SR). Traditional decision tree (DT) algorithms partition a dataset on axis-parallel splits. When the true boundaries are not along the feature axes, DT is likely to have a complicated structure and a dense decision boundary. In this paper, we introduce SR-Enhanced DT (SREDT) - a method which utilizes SR to increase the richness of the class of possible DT splits. We evaluate SREDT on both synthetic and real-world datasets. Despite its simplicity, our method produces surprisingly small trees that outperform both DT and oblique DT (ODT) on supervised classification tasks in terms of accuracy and F-score. We show empirically that SREDTs decrease inference time (compared to DT and ODT) and argue that they allow us to obtain more explainable descriptions of the decision process. SREDT also performs competitively against state-of-the-art tabular classification methods, including tree ensembles and deep models. Finally, we introduce a local search mechanism to improve SREDT and evaluate it on 56 PMLB datasets. This mechanism shows improved performance on 77.2% of the datasets, outperforming DT and ODT. In terms of F-Score, local SREDT outperforms DT and ODT in 82.5% and 73.7% of the datasets respectively and in terms of inference time, local SREDT requires 25.8% and 26.6% less inference time than DT and ODT respectively. \ No newline at end of file diff --git a/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning b/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning new file mode 100644 index 0000000000..ff573d737c --- /dev/null +++ b/data/2024/aaai/Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning @@ -0,0 +1 @@ +In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution. \ No newline at end of file diff --git a/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization b/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization new file mode 100644 index 0000000000..7f00135df4 --- /dev/null +++ b/data/2024/aaai/Symmetric Self-Paced Learning for Domain Generalization @@ -0,0 +1,8 @@ +Deep learning methods often suffer performance degradation due to domain shift, where discrepancies exist between training and testing data distributions. +Domain generalization mitigates this problem by leveraging information from multiple source domains to enhance model generalization capabilities for unseen domains. +However, existing domain generalization methods typically present examples to the model in a random manner, overlooking the potential benefits of structured data presentation. +To bridge this gap, we propose a novel learning strategy, Symmetric Self-Paced Learning (SSPL), for domain generalization. +SSPL consists of a Symmetric Self-Paced training scheduler and a Gradient-based Difficulty Measure (GDM). +Specifically, the proposed training scheduler initially focuses on easy examples, gradually shifting emphasis to harder examples as training progresses. +GDM dynamically evaluates example difficulty through the gradient magnitude with respect to the example itself. +Experiments across five popular benchmark datasets demonstrate the effectiveness of the proposed learning strategy. \ No newline at end of file diff --git a/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos b/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos new file mode 100644 index 0000000000..21253f7ec1 --- /dev/null +++ b/data/2024/aaai/Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos @@ -0,0 +1 @@ +Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets always works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf \ No newline at end of file diff --git a/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction b/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction new file mode 100644 index 0000000000..79565edc56 --- /dev/null +++ b/data/2024/aaai/Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction @@ -0,0 +1 @@ +Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon. \ No newline at end of file diff --git a/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement b/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement new file mode 100644 index 0000000000..f32b7781aa --- /dev/null +++ b/data/2024/aaai/Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement @@ -0,0 +1 @@ +Visually restoring underwater scenes primarily involves mitigating interference from underwater media. Existing methods ignore the inherent scale-related characteristics in underwater scenes. Therefore, we present the synergistic multi-scale detail refinement via intrinsic supervision (SMDR-IS) for enhancing underwater scene details, which contain multi-stages. The low-degradation stage from the original images furnishes the original stage with multi-scale details, achieved through feature propagation using the Adaptive Selective Intrinsic Supervised Feature (ASISF) module. By using intrinsic supervision, the ASISF module can precisely control and guide feature transmission across multi-degradation stages, enhancing multi-scale detail refinement and minimizing the interference from irrelevant information in the low-degradation stage. In multi-degradation encoder-decoder framework of SMDR-IS, we introduce the Bifocal Intrinsic-Context Attention Module (BICA). Based on the intrinsic supervision principles, BICA efficiently exploits multi-scale scene information in images. BICA directs higher-resolution spaces by tapping into the insights of lower-resolution ones, underscoring the pivotal role of spatial contextual relationships in underwater image restoration. Throughout training, the inclusion of a multi-degradation loss function can enhance the network, allowing it to adeptly extract information across diverse scales. When benchmarked against state-of-the-art methods, SMDR-IS consistently showcases superior performance. Our code is available at https://github.com/zhoujingchun03/SMDR-IS \ No newline at end of file diff --git a/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking b/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking new file mode 100644 index 0000000000..0323f4a4c1 --- /dev/null +++ b/data/2024/aaai/T-NET: Weakly Supervised Graph Learning for Combatting Human Trafficking @@ -0,0 +1,3 @@ +Human trafficking (HT) for forced sexual exploitation, often described as modern-day slavery, is a pervasive problem that affects millions of people worldwide. Perpetrators of this crime post advertisements (ads) on behalf of their victims on adult service websites (ASW). These websites typically contain hundreds of thousands of ads including those posted by independent escorts, massage parlor agencies and spammers (fake ads). Detecting suspicious activity in these ads is difficult and developing data-driven methods is challenging due to the hard-to-label, complex and sensitive nature of the data. + +In this paper, we propose T-Net, which unlike previous solutions, formulates this problem as weakly supervised classification. Since it takes several months to years to investigate a case and obtain a single definitive label, we design domain-specific signals or indicators that provide weak labels. T-Net also looks into connections between ads and models the problem as a graph learning task instead of classifying ads independently. We show that T-Net outperforms all baselines on a real-world dataset of ads by 7% average weighted F1 score. Given that this data contains personally identifiable information, we also present a realistic data generator and provide the first publicly available dataset in this domain which may be leveraged by the wider research community. \ No newline at end of file diff --git a/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models b/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models new file mode 100644 index 0000000000..fe111098ba --- /dev/null +++ b/data/2024/aaai/T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models @@ -0,0 +1 @@ +The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., structure and color) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn low-cost T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications. Our code is available at https://github.com/TencentARC/T2I-Adapter. \ No newline at end of file diff --git a/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration b/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration new file mode 100644 index 0000000000..e329f53cec --- /dev/null +++ b/data/2024/aaai/T2MAC: Targeted and Trusted Multi-Agent Communication through Selective Engagement and Evidence-Driven Integration @@ -0,0 +1 @@ +Communication stands as a potent mechanism to harmonize the behaviors of multiple agents. However, existing work primarily concentrates on broadcast communication, which not only lacks practicality, but also leads to information redundancy. This surplus, one-fits-all information could adversely impact the communication efficiency. Furthermore, existing works often resort to basic mechanisms to integrate observed and received information, impairing the learning process. To tackle these difficulties, we propose Targeted and Trusted Multi-Agent Communication (T2MAC), a straightforward yet effective method that enables agents to learn selective engagement and evidence-driven integration. With T2MAC, agents have the capability to craft individualized messages, pinpoint ideal communication windows, and engage with reliable partners, thereby refining communication efficiency. Following the reception of messages, the agents integrate information observed and received from different sources at an evidence level. This process enables agents to collectively use evidence garnered from multiple perspectives, fostering trusted and cooperative behaviors. We evaluate our method on a diverse set of cooperative multi-agent tasks, with varying difficulties, involving different scales and ranging from Hallway, MPE to SMAC. The experiments indicate that the proposed model not only surpasses the state-of-the-art methods in terms of cooperative performance and communication efficiency, but also exhibits impressive generalization. \ No newline at end of file diff --git a/data/2024/aaai/TA&AT: Enhancing Task-Oriented Dialog with Turn-Level Auxiliary Tasks and Action-Tree Based Scheduled Sampling b/data/2024/aaai/TA&AT: Enhancing Task-Oriented Dialog with Turn-Level Auxiliary Tasks and Action-Tree Based Scheduled Sampling new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification b/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification new file mode 100644 index 0000000000..7c2da786c0 --- /dev/null +++ b/data/2024/aaai/TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification @@ -0,0 +1 @@ +Cross-domain text classification aims to transfer models from label-rich source domains to label-poor target domains, giving it a wide range of practical applications. Many approaches promote cross-domain generalization by capturing domaininvariant features. However, these methods rely on unlabeled samples provided by the target domains, which renders the model ineffective when the target domain is agnostic. Furthermore, the models are easily disturbed by shortcut learning in the source domain, which also hinders the improvement of domain generalization ability. To solve the aforementioned issues, this paper proposes TACIT, a target domain agnostic feature disentanglement framework which adaptively decouples robust and unrobust features by Variational Auto-Encoders. Additionally, to encourage the separation of unrobust features from robust features, we design a feature distillation task that compels unrobust features to approximate the output of the teacher. The teacher model is trained with a few easy samples that are easy to carry potential unknown shortcuts. Experimental results verify that our framework achieves comparable results to state-of-the-art baselines while utilizing only source domain data. \ No newline at end of file diff --git a/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient b/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient new file mode 100644 index 0000000000..5d9dd76046 --- /dev/null +++ b/data/2024/aaai/TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient @@ -0,0 +1 @@ +Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology. \ No newline at end of file diff --git a/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation b/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation new file mode 100644 index 0000000000..daff4f13cc --- /dev/null +++ b/data/2024/aaai/TAU: Trajectory Data Augmentation with Uncertainty for Next POI Recommendation @@ -0,0 +1,2 @@ +Next Point-of-Interest (POI) recommendation has been proven effective at utilizing sparse, intricate spatial-temporal trajectory data to recommend subsequent POIs to users. While existing methods commonly alleviate the problem of data sparsity by integrating spatial-temporal context information, POI category features, and social relationships, they largely overlook the fact that the trajectory sequences collected in the datasets are often incomplete. This oversight limits the model’s potential to fully leverage historical context. In light of this background, we propose Trajectory Data Augmentation with Uncertainty (TAU) for Next POI Recommendation. TAU is a general graph-based trajectory data augmentation method designed to complete user mobility patterns by marrying uncertainty estimation into the next POI recommendation task. More precisely, TAU taps into the global transition pattern graph to identify sets of intermediate nodes located between every pair of locations, effectively +leveraging edge weights as transition probabilities. During trajectory sequence construction, TAU selectively prompts intermediate nodes, chosen based on their likelihood of occurrence as pseudo-labels, to establish comprehensive trajectory sequences. Furthermore, to gauge the certainty and impact of pseudo-labels on the target location, we introduce a novel confidence-aware calibration strategy using evidence deep learning (EDL) for improved performance and reliability. The experimental results clearly indicate that our TAU method achieves consistent performance improvements over existing techniques across two real-world datasets, verifying its effectiveness as the state-of-the-art approach to the task. \ No newline at end of file diff --git a/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling b/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling new file mode 100644 index 0000000000..5b5c37ac79 --- /dev/null +++ b/data/2024/aaai/TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling @@ -0,0 +1 @@ +The identification of sensory cues associated with potential opportunities and dangers is frequently complicated by unrelated events that separate useful cues by long delays. As a result, it remains a challenging task for state-of-the-art spiking neural networks (SNNs) to establish long-term temporal dependency between distant cues. To address this challenge, we propose a novel biologically inspired Two-Compartment Leaky Integrate-and-Fire spiking neuron model, dubbed TC-LIF. The proposed model incorporates carefully designed somatic and dendritic compartments that are tailored to facilitate learning long-term temporal dependencies. Furthermore, the theoretical analysis is provided to validate the effectiveness of TC-LIF in propagating error gradients over an extended temporal duration. Our experimental results, on a diverse range of temporal classification tasks, demonstrate superior temporal classification capability, rapid training convergence, and high energy efficiency of the proposed TC-LIF model. Therefore, this work opens up a myriad of opportunities for solving challenging temporal processing tasks on emerging neuromorphic computing systems. Our code is publicly available at https://github.com/ZhangShimin1/TC-LIF. \ No newline at end of file diff --git a/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions b/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions new file mode 100644 index 0000000000..6de077258a --- /dev/null +++ b/data/2024/aaai/TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions @@ -0,0 +1 @@ +A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. This facilitates extracting region trajectory patterns. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movement, of a region in motion. TCNet's correlation module utilizes a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5\% and 1.0\% word error rate on PHOENIX14 and PHOENIX14-T, respectively. Code is available at https://github.com/hotfinda/TCNet \ No newline at end of file diff --git a/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement b/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement new file mode 100644 index 0000000000..fd720fc4fa --- /dev/null +++ b/data/2024/aaai/TDeLTA: A Light-Weight and Robust Table Detection Method Based on Learning Text Arrangement @@ -0,0 +1 @@ +The diversity of tables makes table detection a great challenge, leading to existing models becoming more tedious and complex. Despite achieving high performance, they often overfit to the table style in training set, and suffer from significant performance degradation when encountering out-of-distribution tables in other domains. To tackle this problem, we start from the essence of the table, which is a set of text arranged in rows and columns. Based on this, we propose a novel, light-weighted and robust Table Detection method based on Learning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as input, and then models the arrangement of them with a sequential encoder and an attention module. To locate the tables precisely, we design a text-classification task, classifying the text blocks into 4 categories according to their semantic roles in the tables. Experiments are conducted on both the text blocks parsed from PDF and extracted by open-source OCR tools, respectively. Compared to several state-of-the-art methods, TDeLTA achieves competitive results with only 3.1M model parameters on the large-scale public datasets. Moreover, when faced with the cross-domain data under the 0-shot setting, TDeLTA outperforms baselines by a large margin of nearly 7%, which shows the strong robustness and transferability of the proposed model. \ No newline at end of file diff --git "a/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation" "b/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation" new file mode 100644 index 0000000000..01e3bab2e8 --- /dev/null +++ "b/data/2024/aaai/TD\302\262-Net: Toward Denoising and Debiasing for Video Scene Graph Generation" @@ -0,0 +1,2 @@ +Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD2-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. +Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD2-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD2-Net outperforms the second-best competitors by 12.7% on mean-Recall@10 for predicate classification. \ No newline at end of file diff --git a/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint) b/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint) new file mode 100644 index 0000000000..3cda98bf77 --- /dev/null +++ b/data/2024/aaai/TEAMSTER: Model-Based Reinforcement Learning for Ad Hoc Teamwork (Abstract Reprint) @@ -0,0 +1 @@ +This paper investigates the use of model-based reinforcement learning in the context of ad hoc teamwork. We introduce a novel approach, named TEAMSTER, where we propose learning both the environment's model and the model of the teammates' behavior separately. Compared to the state-of-the-art PLASTIC algorithms, our results in four different domains from the multi-agent systems literature show that TEAMSTER is more flexible than the PLASTIC-Model, by learning the environment's model instead of assuming a perfect hand-coded model, and more robust/efficient than PLASTIC-Policy, by being able to continuously adapt to newly encountered teams, without implicitly learning a new environment model from scratch. \ No newline at end of file diff --git a/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation b/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation new file mode 100644 index 0000000000..247d4f90bf --- /dev/null +++ b/data/2024/aaai/TETRIS: Towards Exploring the Robustness of Interactive Segmentation @@ -0,0 +1 @@ +Interactive segmentation methods rely on user inputs to iteratively update the selection mask. A click specifying the object of interest is arguably the most simple and intuitive interaction type, and thereby the most common choice for interactive segmentation. However, user clicking patterns in the interactive segmentation context remain unexplored. Accordingly, interactive segmentation evaluation strategies rely more on intuition and common sense rather than empirical studies (e.g., assuming that users tend to click in the center of the area with the largest error). In this work, we conduct a real-user study to investigate real user clicking patterns. This study reveals that the intuitive assumption made in the common evaluation strategy may not hold. As a result, interactive segmentation models may show high scores in the standard benchmarks, but it does not imply that they would perform well in a real world scenario. To assess the applicability of interactive segmentation methods, we propose a novel evaluation strategy providing a more comprehensive analysis of a model's performance. To this end, we propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model. Based on the performance with such adversarial user inputs, we assess the robustness of interactive segmentation models w.r.t click positions. Besides, we introduce a novel benchmark for measuring the robustness of interactive segmentation, and report the results of an extensive evaluation of dozens of models. \ No newline at end of file diff --git a/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract) b/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract) new file mode 100644 index 0000000000..e595fc021a --- /dev/null +++ b/data/2024/aaai/THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract) @@ -0,0 +1 @@ +Social media popularity prediction of multimodal user-generated content (UGC) is a crucial task for many real-world applications. However, existing efforts are often limited by missing inter-instance correlations and UGC temporal patterns. To address these issues, we propose a novel time-aware hypergraph Transformer framework, THGFormer. It fully represents inter-instance and intra-instance relations by hypergraphs, captures the temporal dependencies with a time encoder, and enhances UGC's representations via a neighborhood knowledge aggregation. Extensive experiments conducted on two real-world datasets demonstrate that THGFormer outperforms state-of-the-art popularity prediction models across several settings. \ No newline at end of file diff --git a/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation b/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation new file mode 100644 index 0000000000..e7aa5d3db8 --- /dev/null +++ b/data/2024/aaai/TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation @@ -0,0 +1 @@ +Continual Semantic Segmentation (CSS) is an emerging trend, where catastrophic forgetting has been a perplexing problem. In this paper, we propose a Text-to-Image Knowledge Preservation (TIKP) framework to address this issue. TIKP applies Text-to-Image techniques to CSS by automatically generating prompts and content adaptation. It extracts associations between the labels of seen data and constructs text-level prompts based on these associations, which are preserved and maintained at each incremental step. During training, these prompts generate correlated images to mitigate the catastrophic forgetting. Particularly, as the generated images may have different distributions from the original data, TIKP transfers the knowledge by a content adaption loss, which determines the role played by the generated images in incremental training based on the similarity. In addition, for the classifier, we use the previous model from a different perspective: misclassifying new classes into old objects instead of the background. We propose a knowledge distillation loss based on wrong labels, enabling us to attribute varying weights to individual objects during the distillation process. Extensive experiments conducted in the same setting show that TIKP outperforms state-of-the-art methods by a large margin on benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities b/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities new file mode 100644 index 0000000000..321297d8ea --- /dev/null +++ b/data/2024/aaai/TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities @@ -0,0 +1 @@ +Numerous techniques excel in brain tumor segmentation using multi-modal magnetic resonance imaging (MRI) sequences, delivering exceptional results. However, the prevalent absence of modalities in clinical scenarios hampers performance. Current approaches frequently resort to zero maps as substitutes for missing modalities, inadvertently introducing feature bias and redundant computations. To address these issues, we present the Token Merging transFormer (TMFormer) for robust brain tumor segmentation with missing modalities. TMFormer tackles these challenges by extracting and merging accessible modalities into more compact token sequences. The architecture comprises two core components: the Uni-modal Token Merging Block (UMB) and the Multi-modal Token Merging Block (MMB). The UMB enhances individual modality representation by adaptively consolidating spatially redundant tokens within and outside tumor-related regions, thereby refining token sequences for augmented representational capacity. Meanwhile, the MMB mitigates multi-modal feature fusion bias, exclusively leveraging tokens from present modalities and merging them into a unified multi-modal representation to accommodate varying modality combinations. Extensive experimental results on the BraTS 2018 and 2020 datasets demonstrate the superiority and efficacy of TMFormer compared to state-of-the-art methods when dealing with missing modalities. \ No newline at end of file diff --git a/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences b/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences new file mode 100644 index 0000000000..ca7ea10efd --- /dev/null +++ b/data/2024/aaai/TNPAR: Topological Neural Poisson Auto-Regressive Model for Learning Granger Causal Structure from Event Sequences @@ -0,0 +1 @@ +Learning Granger causality from event sequences is a challenging but essential task across various applications. Most existing methods rely on the assumption that event sequences are independent and identically distributed (i.i.d.). However, this i.i.d. assumption is often violated due to the inherent dependencies among the event sequences. Fortunately, in practice, we find these dependencies can be modeled by a topological network, suggesting a potential solution to the non-i.i.d. problem by introducing the prior topological network into Granger causal discovery. This observation prompts us to tackle two ensuing challenges: 1) how to model the event sequences while incorporating both the prior topological network and the latent Granger causal structure, and 2) how to learn the Granger causal structure. To this end, we devise a unified topological neural Poisson auto-regressive model with two processes. In the generation process, we employ a variant of the neural Poisson process to model the event sequences, considering influences from both the topological network and the Granger causal structure. In the inference process, we formulate an amortized inference algorithm to infer the latent Granger causal structure. We encapsulate these two processes within a unified likelihood function, providing an end-to-end framework for this task. Experiments on simulated and real-world data demonstrate the effectiveness of our approach. \ No newline at end of file diff --git a/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation b/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation new file mode 100644 index 0000000000..87923f212c --- /dev/null +++ b/data/2024/aaai/TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation @@ -0,0 +1 @@ +Multi-spectral object Re-identification (ReID) aims to retrieve specific objects by leveraging complementary information from different image spectra. It delivers great advantages over traditional single-spectral ReID in complex visual environment. However, the significant distribution gap among different image spectra poses great challenges for effective multi-spectral feature representations. In addition, most of current Transformer-based ReID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object ReID, dubbled TOP-ReID. More specifically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, we propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. It not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, we propose a Complementary Reconstruction Module (CRM), which introduces dense token-level reconstruction constraints to reduce the distribution gap across different image spectra. With the above modules, our proposed framework can generate more discriminative multi-spectral features for robust object ReID. Extensive experiments on three ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods. The code is available at https://github.com/924973292/TOP-ReID. \ No newline at end of file diff --git a/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection b/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection new file mode 100644 index 0000000000..73c9291076 --- /dev/null +++ b/data/2024/aaai/TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection @@ -0,0 +1 @@ +Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at https://github.com/mingyao1120/TR-DETR. \ No newline at end of file diff --git a/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks b/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks new file mode 100644 index 0000000000..e3dac626cc --- /dev/null +++ b/data/2024/aaai/TREE-G: Decision Trees Contesting Graph Neural Networks @@ -0,0 +1,21 @@ +When dealing with tabular data, models based on decision +trees are a popular choice due to their high accuracy on these +data types, their ease of application, and explainability properties. However, when it comes to graph-structured data, it +is not clear how to apply them effectively, in a way that in- +corporates the topological information with the tabular data +available on the vertices of the graph. To address this challenge, +we introduce TREE-G. TREE-G modifies standard decision +trees, by introducing a novel split function that is specialized +for graph data. Not only does this split function incorporate +the node features and the topological information, but it also +uses a novel pointer mechanism that allows split nodes to +use information computed in previous splits. Therefore, the +split function adapts to the predictive task and the graph at +hand. We analyze the theoretical properties of TREE-G and +demonstrate its benefits empirically on multiple graph and +vertex prediction benchmarks. In these experiments, TREE-G +consistently outperforms other tree-based models and often +outperforms other graph-learning algorithms such as Graph +Neural Networks (GNNs) and Graph Kernels, sometimes by +large margins. Moreover, TREE-Gs models and their predic +tions can be explained and visualized. \ No newline at end of file diff --git a/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples b/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples new file mode 100644 index 0000000000..9307d59588 --- /dev/null +++ b/data/2024/aaai/TTTS: Tree Test Time Simulation for Enhancing Decision Tree Robustness against Adversarial Examples @@ -0,0 +1 @@ +Decision trees are widely used for addressing learning tasks involving tabular data. Yet, they are susceptible to adversarial attacks. In this paper, we present Tree Test Time Simulation (TTTS), a novel inference-time methodology that incorporates Monte Carlo simulations into decision trees to enhance their robustness. TTTS introduces a probabilistic modification to the decision path, without altering the underlying tree structure. Our comprehensive empirical analysis of 50 datasets yields promising results. Without the presence of any attacks, TTTS has successfully improved model performance from an AUC of 0.714 to 0.773. Under the challenging conditions of white-box attacks, TTTS demonstrated its robustness by boosting performance from an AUC of 0.337 to 0.680. Even when subjected to black-box attacks, TTTS maintains high accuracy and enhances the model's performance from an AUC of 0.628 to 0.719. Compared to defenses such as Feature Squeezing, TTTS proves to be much more effective. We also found that TTTS exhibits similar robustness in decision forest settings across different attacks. \ No newline at end of file diff --git a/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues b/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues new file mode 100644 index 0000000000..c613b13bfc --- /dev/null +++ b/data/2024/aaai/Tackling Vision Language Tasks through Learning Inner Monologues @@ -0,0 +1,2 @@ +Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. +To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks. \ No newline at end of file diff --git a/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training b/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training new file mode 100644 index 0000000000..0716d31b1e --- /dev/null +++ b/data/2024/aaai/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training @@ -0,0 +1,2 @@ +Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. +In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP. \ No newline at end of file diff --git a/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection b/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection new file mode 100644 index 0000000000..b9e76e8b0c --- /dev/null +++ b/data/2024/aaai/TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection @@ -0,0 +1 @@ +Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at https://github.com/Cverchen/TagFog. \ No newline at end of file diff --git a/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation b/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation new file mode 100644 index 0000000000..287a575896 --- /dev/null +++ b/data/2024/aaai/Tail-STEAK: Improve Friend Recommendation for Tail Users via Self-Training Enhanced Knowledge Distillation @@ -0,0 +1 @@ +Graph neural networks (GNNs) are commonly employed in collaborative friend recommendation systems. Nevertheless, recent studies reveal a notable performance gap, particularly for users with limited connections, commonly known as tail users, in contrast to their counterparts with abundant connections (head users). Uniformly treating head and tail users poses two challenges for tail user preference learning: (C1) Label Sparsity, as tail users typically possess limited labels; and (C2) Neighborhood Sparsity, where tail users exhibit sparse observable friendships, leading to distinct preference distributions and performance degradation compared to head users. In response to these challenges, we introduce Tail-STEAK, an innovative framework that combines self-training with enhanced knowledge distillation for tail user representation learning. To address(C1), we present Tail-STEAK-base, a two-stage self-training framework. In the first stage, only head users and their accurate connections are utilized for training, while pseudo links are generated for tail users in the second stage. To tackle (C2), we propose two data augmentation-based self-knowledge distillation pretext tasks. These tasks are seamlessly integrated into different stages of Tail-STEAK-base, culminating in the comprehensive Tail-STEAK framework. Extensive experiments, conducted on state-of-the-art GNN-based friend recommendation models, substantiate the efficacy of Tail-STEAK in significantly improving tail user performance. Our code and data are publicly available at https://github.com/antman9914/Tail-STEAK. \ No newline at end of file diff --git a/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation b/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation new file mode 100644 index 0000000000..65946442cb --- /dev/null +++ b/data/2024/aaai/Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation @@ -0,0 +1,6 @@ +Humor is a crucial part of human communication. Understanding humor and generating humorous responses in dialogue can provide natural and empathic human-computer interactions. +However, most existing pre-trained language models (PLMs) perform unsatisfactorily in humor generation. +On one hand, the serious shortage of humor corpus and datasets pose challenges for constructing models that can understand and generate humorous expressions. On the other hand, humor generation relies on rich knowledge and commonsense, which is often tacit and unspoken. +In this paper, we construct the largest Chinese Explainable Humor Response Dataset to date with chain-of-humor and humor mind map annotations, which can be used to comprehensively evaluate as well as improve the humorous response ability of PLMs. +We further design humor-related auxiliary tasks to further enhance PLMs' humorous response performance. +Extensive evaluations demonstrate that our proposed dataset and auxiliary tasks effectively help PLMs to generate humorous responses, laying the groundwork for future humor research. \ No newline at end of file diff --git a/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs b/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs new file mode 100644 index 0000000000..16a3a533ff --- /dev/null +++ b/data/2024/aaai/Taming Binarized Neural Networks and Mixed-Integer Programs @@ -0,0 +1,5 @@ +There has been a great deal of recent interest in binarized neural networks, especially because of their explainability. At the same time, automatic differentiation algorithms such as backpropagation fail for binarized neural networks, which limits their applicability. +We show that binarized neural networks admit a tame representation +by reformulating the problem of training binarized neural networks as a subadditive dual of a mixed-integer program, which we show to have nice properties. This makes it possible to use the framework of Bolte et al. for implicit differentiation, which offers the possibility for practical implementation of backpropagation in the context of binarized neural networks. + +This approach could also be used for a broader class of mixed-integer programs, beyond the training of binarized neural networks, as encountered in symbolic approaches to AI and beyond. \ No newline at end of file diff --git a/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification b/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification new file mode 100644 index 0000000000..2c6ee513b0 --- /dev/null +++ b/data/2024/aaai/Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification @@ -0,0 +1,2 @@ +Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. +In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to k active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck. \ No newline at end of file diff --git a/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking b/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking new file mode 100644 index 0000000000..5a59c5934b --- /dev/null +++ b/data/2024/aaai/Target Focused Shallow Transformer Framework for Efficient Visual Tracking @@ -0,0 +1 @@ +Template learning transformer trackers have achieved significant performance improvement recently due to the longdependency learning using the self-attention (SA) mechanism. However, the typical SA mechanisms in transformers adopt a less discriminative design approach which is inadequate for focusing on the most important target information during tracking. Therefore, existing trackers are easily distracted by background information and have constraints in handling tracking challenges. The focus of our research is to develop a target-focused discriminative shallow transformer tracking framework that can learn to distinguish the target from the background and enable accurate tracking with fast speed. Extensive experiments will be performed on several popular benchmarks, including OTB100, UAV123, GOT10k, LaSOT, and TrackingNet, to demonstrate the effectiveness of the proposed framework. \ No newline at end of file diff --git a/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract) b/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract) new file mode 100644 index 0000000000..7100607905 --- /dev/null +++ b/data/2024/aaai/Target-Free Domain Adaptation through Cross-Adaptation (Student Abstract) @@ -0,0 +1 @@ +The population characteristics of the datasets related to the same task may vary significantly and merging them may harm performance. In this paper, we propose a novel method of domain adaptation called "cross-adaptation". It allows for implicit adaptation to the target domain without the need for any labeled examples across this domain. We test our approach on 9 datasets for SARS-CoV-2 detection from complete blood count from different hospitals around the world. Results show that our solution is universal with respect to various classification algorithms and allows for up to a 10pp increase in F1 score on average. \ No newline at end of file diff --git a/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals b/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals new file mode 100644 index 0000000000..61b10243b1 --- /dev/null +++ b/data/2024/aaai/Targeted Activation Penalties Help CNNs Ignore Spurious Signals @@ -0,0 +1 @@ +Neural networks (NNs) can learn to rely on spurious signals in the training data, leading to poor generalisation. Recent methods tackle this problem by training NNs with additional ground-truth annotations of such signals. These methods may, however, let spurious signals re-emerge in deep convolutional NNs (CNNs). We propose Targeted Activation Penalty (TAP), a new method tackling the same problem by penalising activations to control the re-emergence of spurious signals in deep CNNs, while also lowering training times and memory usage. In addition, ground-truth annotations can be expensive to obtain. We show that TAP still works well with annotations generated by pre-trained models as effective substitutes of ground-truth annotations. We demonstrate the power of TAP against two state-of-the-art baselines on the MNIST benchmark and on two clinical image datasets, using four different CNN architectures. \ No newline at end of file diff --git a/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore b/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore new file mode 100644 index 0000000000..7a7189c6d7 --- /dev/null +++ b/data/2024/aaai/Task Contamination: Language Models May Not Be Few-Shot Anymore @@ -0,0 +1 @@ +Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot or few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over datasets released over time, and over LLMs released over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that datasets released prior to the LLM training data creation date perform surprisingly better than datasets released post the LLM training data creation date. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, training data extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings. \ No newline at end of file diff --git a/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments b/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments new file mode 100644 index 0000000000..a782715d5a --- /dev/null +++ b/data/2024/aaai/Task Planning for Object Rearrangement in Multi-Room Environments @@ -0,0 +1 @@ +Object rearrangement in a multi-room setup should produce a reasonable plan that reduces the agent's overall travel and the number of steps. Recent state-of-the-art methods fail to produce such plans because they rely on explicit exploration for discovering unseen objects due to partial observability and a heuristic planner to sequence the actions for rearrangement. This paper proposes a novel task planner to efficiently plan a sequence of actions to discover unseen objects and rearrange misplaced objects within an untidy house to achieve a desired tidy state. The proposed method introduces several innovative techniques, including (i) a method for discovering unseen objects using commonsense knowledge from large language models, (ii) a collision resolution and buffer prediction method based on Cross-Entropy Method to handle blocked goal and swap cases, (iii) a directed spatial graph-based state space for scalability, and (iv) deep reinforcement learning (RL) for producing an efficient plan to simultaneously discover unseen objects and rearrange the visible misplaced ones to minimize the overall traversal. The paper also presents new metrics and a benchmark dataset called MoPOR to evaluate the effectiveness of the rearrangement planning in a multi-room setting. The experimental results demonstrate that the proposed method effectively addresses the multi-room rearrangement problem. \ No newline at end of file diff --git a/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning b/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning new file mode 100644 index 0000000000..3152b72448 --- /dev/null +++ b/data/2024/aaai/Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning @@ -0,0 +1 @@ +Cross-Domain Few-Shot Learning (CD-FSL) aims at recognizing samples in novel classes from unseen domains that are vastly different from training classes, with few labeled samples. However, the large domain gap between training and novel classes makes previous FSL methods perform poorly. To address this issue, we propose MetaPrompt, a Task-adaptive Prompted Transformer model for CD-FSL, by jointly exploiting prompt learning and the parameter generation framework. The proposed MetaPrompt enjoys several merits. First, a task-conditioned prompt generator is established upon attention mechanisms. It can flexibly produce a task-adaptive prompt with arbitrary length for unseen tasks, by selectively gathering task characteristics from the contextualized support embeddings. Second, the task-adaptive prompt is attached to Vision Transformer to facilitate fast task adaptation, steering the task-agnostic representation to incorporate task knowledge. To our best knowledge, this is the first work to exploit a prompt-based parameter generation mechanism for CD-FSL. Extensive experimental results on the Meta-Dataset benchmark demonstrate that our method achieves superior results against state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks b/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks new file mode 100644 index 0000000000..596aee0465 --- /dev/null +++ b/data/2024/aaai/Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks @@ -0,0 +1,5 @@ +Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. + +We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL (TAPPFL) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, +TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. +Experimental results also show that TAPPFL outperforms the existing defenses. \ No newline at end of file diff --git a/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation b/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation new file mode 100644 index 0000000000..36a2e52f65 --- /dev/null +++ b/data/2024/aaai/Task-Disruptive Background Suppression for Few-Shot Segmentation @@ -0,0 +1 @@ +Few-shot segmentation aims to accurately segment novel target objects within query images using only a limited number of annotated support images. The recent works exploit support background as well as its foreground to precisely compute the dense correlations between query and support. However, they overlook the characteristics of the background that generally contains various types of objects. In this paper, we highlight this characteristic of background which can bring problematic cases as follows: (1) when the query and support backgrounds are dissimilar and (2) when objects in the support background are similar to the target object in the query. Without any consideration of the above cases, adopting the entire support background leads to a misprediction of the query foreground as background. To address this issue, we propose Task-disruptive Background Suppression(TBS), a module to suppress those disruptive support background features based on two spatial-wise scores: query-relevant and target-relevant scores. The former aims to mitigate the impact of unshared features solely existing in the support background, while the latter aims to reduce the influence of target-similar support background features. Based on these two scores, we define a query background relevant score that captures the similarity between the backgrounds of the query and the support, and utilize it to scale support background features to adaptively restrict the impact of disruptive support backgrounds. Our proposed method achieves state-of-the-art performance on standard few-shot segmentation benchmarks. Our official code is available at github.com/SuhoPark0706/TBSNet. \ No newline at end of file diff --git a/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction b/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction new file mode 100644 index 0000000000..f0bdf473b2 --- /dev/null +++ b/data/2024/aaai/Task-Driven Causal Feature Distillation: Towards Trustworthy Risk Prediction @@ -0,0 +1 @@ +Since artificial intelligence has seen tremendous recent successes in many areas, it has sparked great interest in its potential for trustworthy and interpretable risk prediction. However, most models lack causal reasoning and struggle with class imbalance, leading to poor precision and recall. To address this, we propose a Task-Driven Causal Feature Distillation model (TDCFD) to transform original feature values into causal feature attributions for the specific risk prediction task. The causal feature attribution helps describe how much contribution the value of this feature can make to the risk prediction result. After the causal feature distillation, a deep neural network is applied to produce trustworthy prediction results with causal interpretability and high precision/recall. We evaluate the performance of our TDCFD method on several synthetic and real datasets, and the results demonstrate its superiority over the state-of-the-art methods regarding precision, recall, interpretability, and causality. \ No newline at end of file diff --git a/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster b/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster new file mode 100644 index 0000000000..7a15ac70a6 --- /dev/null +++ b/data/2024/aaai/Task-Free Continual Generation and Representation Learning via Dynamic Expansionable Memory Cluster @@ -0,0 +1 @@ +Human brains can continually acquire and learn new skills and knowledge over time from a dynamically changing environment without forgetting previously learnt information. Such a capacity can selectively transfer some important and recently seen information to the persistent knowledge regions of the brain. Inspired by this intuition, we propose a new memory-based approach for image reconstruction and generation in continual learning, consisting of a temporary and evolving memory, with two different storage strategies, corresponding to the temporary and permanent memorisation. The temporary memory aims to preserve up-to-date information while the evolving memory can dynamically increase its capacity in order to preserve permanent knowledge information. This is achieved by the proposed memory expansion mechanism that selectively transfers those data samples deemed as important from the temporary memory to new clusters defined within the evolved memory according to an information novelty criterion. Such a mechanism promotes the knowledge diversity among clusters in the evolved memory, resulting in capturing more diverse information by using a compact memory capacity. Furthermore, we propose a two-step optimization strategy for training a Variational Autoencoder (VAE) to implement generation and representation learning tasks, which updates the generator and inference models separately using two optimisation paths. This approach leads to a better trade-off between generation and reconstruction performance. We show empirically and theoretically that the proposed approach can learn meaningful latent representations while generating diverse images from different domains. The source code and supplementary material (SM) are available at https://github.com/dtuzi123/DEMC. \ No newline at end of file diff --git a/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning b/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning new file mode 100644 index 0000000000..316d1ae3f8 --- /dev/null +++ b/data/2024/aaai/Task-Free Dynamic Sparse Vision Transformer for Continual Learning @@ -0,0 +1 @@ +Vision Transformers (ViTs) represent self-attention-based network backbones shown to be efficient in many individual tasks, but which have not been explored in Task-Free Continual Learning (TFCL) so far. Most existing ViT-based approaches for Continual Learning (CL) are relying on task information. In this study, we explore the advantages of the ViT in a more challenging CL scenario where the task boundaries are unavailable during training. To address this learning paradigm, we propose the Task-Free Dynamic Sparse Vision Transformer (TFDSViT), which can dynamically build new sparse experts, where each expert leverages sparsity to allocate the model's capacity for capturing different information categories over time. To avoid forgetting and ensure efficiency in reusing the previously learned knowledge in subsequent learning, we propose a new dynamic dual attention mechanism consisting of the Sparse Attention (SA') and Knowledge Transfer Attention (KTA) modules. The SA' refrains from updating some previously learned attention blocks for preserving prior knowledge. The KTA uses and regulates the information flow of all previously learned experts for learning new patterns. The proposed dual attention mechanism can simultaneously relieve forgetting and promote knowledge transfer for a dynamic expansion model in a task-free manner. We also propose an energy-based dynamic expansion mechanism using the energy as a measure of novelty for the incoming samples which provides appropriate expansion signals leading to a compact network architecture for TFDSViT. Extensive empirical studies demonstrate the effectiveness of TFDSViT. The code and supplementary material (SM) are available at https://github.com/dtuzi123/TFDSViT. \ No newline at end of file diff --git a/data/2024/aaai/Taxonomy Driven Fast Adversarial Training b/data/2024/aaai/Taxonomy Driven Fast Adversarial Training new file mode 100644 index 0000000000..209cd373bb --- /dev/null +++ b/data/2024/aaai/Taxonomy Driven Fast Adversarial Training @@ -0,0 +1 @@ +Adversarial training (AT) is an effective defense method against gradient-based attacks to enhance the robustness of neural networks. Among them, single-step AT has emerged as a hotspot topic due to its simplicity and efficiency, requiring only one gradient propagation in generating adversarial examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes training collapse remains poorly understood, and there exists a gap between the robust accuracy achieved through single- and multi-step AT. In this paper, we present a surprising finding that the taxonomy of adversarial examples reveals the truth of CO. Based on this conclusion, we propose taxonomy driven fast adversarial training (TDAT) which jointly optimizes learning objective, loss function, and initialization method, thereby can be regarded as a new paradigm of single-step AT. Compared with other fast AT methods, TDAT can boost the robustness of neural networks, alleviate the influence of misclassified examples, and prevent CO during the training process while requiring almost no additional computational and memory resources. Our method achieves robust accuracy improvement of 1.59%, 1.62%, 0.71%, and 1.26% on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our proposed method also achieves state-of-the-art robust accuracy against other attacks. Code is available at https://github.com/bookman233/TDAT. \ No newline at end of file diff --git a/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation b/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation new file mode 100644 index 0000000000..43c3c0ee21 --- /dev/null +++ b/data/2024/aaai/Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation @@ -0,0 +1 @@ +Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods. \ No newline at end of file diff --git a/data/2024/aaai/Teaching Large Language Models to Translate with Comparison b/data/2024/aaai/Teaching Large Language Models to Translate with Comparison new file mode 100644 index 0000000000..e479917636 --- /dev/null +++ b/data/2024/aaai/Teaching Large Language Models to Translate with Comparison @@ -0,0 +1,11 @@ +Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. +However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. +One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. +Moreover, it can be more challenging to tune smaller LLMs with lower-quality training data. +To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. +Our approach involves output comparison and preference comparison, presenting the model with +carefully designed examples of correct and incorrect translations and an additional preference loss for better regularization. +Empirical evaluation on four language directions of WMT2022 and FLORES-200 benchmarks shows the superiority of our proposed method over existing methods. +Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. +Please refer to Github for more details: +https://github.com/lemon0830/TIM. \ No newline at end of file diff --git a/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling b/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling new file mode 100644 index 0000000000..4f86bc1424 --- /dev/null +++ b/data/2024/aaai/TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling @@ -0,0 +1 @@ +To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that integrates multivariate, temporal, and spatial facets for improved accuracy. Experiments reveal our model's superiority over baselines, especially in long-term predictions. We also highlight the potential for GCT flow integration into transportation systems. \ No newline at end of file diff --git a/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization b/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization new file mode 100644 index 0000000000..85a8d3b2c7 --- /dev/null +++ b/data/2024/aaai/Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization @@ -0,0 +1 @@ +Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, in this work, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as CrossSummarizer over the no-personalization and image-based clustering baselines. \ No newline at end of file diff --git a/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt b/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt new file mode 100644 index 0000000000..366635f6bb --- /dev/null +++ b/data/2024/aaai/Temporal Adaptive RGBT Tracking with Modality Prompt @@ -0,0 +1 @@ +RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatio-temporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed. \ No newline at end of file diff --git a/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification b/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification new file mode 100644 index 0000000000..73b1b3788b --- /dev/null +++ b/data/2024/aaai/Temporal Correlation Vision Transformer for Video Person Re-Identification @@ -0,0 +1 @@ +Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID. \ No newline at end of file diff --git a/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models b/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models new file mode 100644 index 0000000000..4001d44777 --- /dev/null +++ b/data/2024/aaai/Temporal Dependencies and Spatio-Temporal Patterns of Time Series Models @@ -0,0 +1 @@ +The widespread use of Artificial Intelligence (AI) has highlighted the importance of understanding AI model behavior. This understanding is crucial for practical decision-making, assessing model reliability, and ensuring trustworthiness. Interpreting time series forecasting models faces unique challenges compared to image and text data. These challenges arise from the temporal dependencies between time steps and the evolving importance of input features over time. My thesis focuses on addressing these challenges by aiming for more precise explanations of feature interactions, uncovering spatiotemporal patterns, and demonstrating the practical applicability of these interpretability techniques using real-world datasets and state-of-the-art deep learning models. \ No newline at end of file diff --git a/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation b/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation new file mode 100644 index 0000000000..24a9dbf289 --- /dev/null +++ b/data/2024/aaai/Temporal Graph Contrastive Learning for Sequential Recommendation @@ -0,0 +1 @@ +Sequential recommendation is a crucial task in understanding users' evolving interests and predicting their future behaviors. While existing approaches on sequence or graph modeling to learn interaction sequences of users have shown promising performance, how to effectively exploit temporal information and deal with the uncertainty noise in evolving user behaviors is still quite challenging. To this end, in this paper, we propose a Temporal Graph Contrastive Learning method for Sequential Recommendation (TGCL4SR) which leverages not only local interaction sequences but also global temporal graphs to comprehend item correlations and analyze user behaviors from a temporal perspective. Specifically, we first devise a Temporal Item Transition Graph (TITG) to fully leverage global interactions to understand item correlations, and augment this graph by dual transformations based on neighbor sampling and time disturbance. Accordingly, we design a Temporal item Transition graph Convolutional network (TiTConv) to capture temporal item transition patterns in TITG. Then, a novel Temporal Graph Contrastive Learning (TGCL) mechanism is designed to enhance the uniformity of representations between augmented graphs from identical sequences. For local interaction sequences, we design a temporal sequence encoder to incorporate time interval embeddings into the architecture of Transformer. At the training stage, we take maximum mean discrepancy and TGCL losses as auxiliary objectives. Extensive experiments on several real-world datasets show the effectiveness of TGCL4SR against state-of-the-art baselines of sequential recommendation. \ No newline at end of file diff --git a/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint) b/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint) new file mode 100644 index 0000000000..e4f1999557 --- /dev/null +++ b/data/2024/aaai/Temporal Logic Explanations for Dynamic Decision Systems Using Anchors and Monte Carlo Tree Search (Abstract Reprint) @@ -0,0 +1 @@ +For many automated perception and decision tasks, state-of-the-art performance may be obtained by algorithms that are too complex for their behavior to be completely understandable or predictable by human users, e.g., because they employ large machine learning models. To integrate these algorithms into safety-critical decision and control systems, it is particularly important to develop methods that can promote trust into their decisions and help explore their failure modes. In this article, we combine the anchors methodology with Monte Carlo Tree Search to provide local model-agnostic explanations for the behaviors of a given black-box model making decisions by processing time-varying input signals. Our approach searches for descriptive explanations for these decisions in the form of properties of the input signals, expressed in Signal Temporal Logic, which are highly likely to reproduce the observed behavior. To illustrate the methodology, we apply it in simulations to the analysis of a hybrid (continuous-discrete) control system and a collision avoidance system for unmanned aircraft (ACAS Xu) implemented by a neural network. \ No newline at end of file diff --git a/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition b/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition new file mode 100644 index 0000000000..dfe550647e --- /dev/null +++ b/data/2024/aaai/Temporal-Distributed Backdoor Attack against Video Based Action Recognition @@ -0,0 +1 @@ +Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies. \ No newline at end of file diff --git a/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation b/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation new file mode 100644 index 0000000000..79523c4b66 --- /dev/null +++ b/data/2024/aaai/Temporally and Distributionally Robust Optimization for Cold-Start Recommendation @@ -0,0 +1,2 @@ +Collaborative Filtering (CF) recommender models highly depend on user-item interactions to learn CF representations, thus falling short of recommending cold-start items. To address this issue, prior studies mainly introduce item features (e.g., thumbnails) for cold-start item recommendation. They learn a feature extractor on warm-start items to align feature representations with interactions, and then leverage the feature extractor to extract the feature representations of cold-start items for interaction prediction. Unfortunately, the features of cold-start items, especially the popular ones, tend to diverge from those of warm-start ones due to temporal feature shifts, preventing the feature extractor from accurately learning feature representations of cold-start items. +To alleviate the impact of temporal feature shifts, we consider using Distributionally Robust Optimization (DRO) to enhance the generation ability of the feature extractor. Nonetheless, existing DRO methods face an inconsistency issue: the worse-case warm-start items emphasized during DRO training might not align well with the cold-start item distribution. To capture the temporal feature shifts and combat this inconsistency issue, we propose a novel temporal DRO with new optimization objectives, namely, 1) to integrate a worst-case factor to improve the worst-case performance, and 2) to devise a shifting factor to capture the shifting trend of item features and enhance the optimization of the potentially popular groups in cold-start items. Substantial experiments on three real-world datasets validate the superiority of our temporal DRO in enhancing the generalization ability of cold-start recommender models. \ No newline at end of file diff --git a/data/2024/aaai/Tensorized Label Learning on Anchor Graph b/data/2024/aaai/Tensorized Label Learning on Anchor Graph new file mode 100644 index 0000000000..3badcb521e --- /dev/null +++ b/data/2024/aaai/Tensorized Label Learning on Anchor Graph @@ -0,0 +1 @@ +Graph-based multimedia data clustering has attracted much attention due to the impressive clustering performance for arbitrarily shaped multimedia data. However, existing graph-based clustering methods need post-processing to get labels for multimedia data with high computational complexity. Moreover, it is sub-optimal for label learning due to the fact that they exploit the complementary information embedded in data with different types pixel by pixel. To handle these problems, we present a novel label learning model with good interpretability for clustering. To be specific, our model decomposes anchor graph into the products of two matrices with orthogonal non-negative constraint to directly get soft label without any post-processing, which remarkably reduces the computational complexity. To well exploit the complementary information embedded in multimedia data, we introduce tensor Schatten p-norm regularization on the label tensor which is composed of soft labels of multimedia data. The solution can be obtained by iteratively optimizing four decoupled sub-problems, which can be solved more efficiently with good convergence. Experimental results on various datasets demonstrate the efficiency of our model. \ No newline at end of file diff --git a/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration b/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration new file mode 100644 index 0000000000..870dd3a0ff --- /dev/null +++ b/data/2024/aaai/Test-Time Adaptation via Style and Structure Guidance for Histological Image Registration @@ -0,0 +1,10 @@ +Image registration plays a crucial role in histological image analysis, encompassing tasks like multi-modality fusion and disease grading. +Traditional registration methods optimize objective functions for each image pair, yielding reliable accuracy but demanding heavy inference burdens. +Recently, learning-based registration methods utilize networks to learn the optimization process during training and apply a one-step forward process during testing. +While these methods offer promising registration performance with reduced inference time, they remain sensitive to appearance variances and local structure changes commonly encountered in histological image registration scenarios. +In this paper, for the first time, we propose a novel test-time adaptation method for histological image registration, aiming to improve the generalization ability of learning-based methods. +Specifically, we design two operations, style guidance and shape guidance, for the test-time adaptation process. +The former leverages style representations encoded by feature statistics to address the issue of appearance variances, while the latter incorporates shape representations encoded by HOG features to improve registration accuracy in regions with structural changes. +Furthermore, we consider the continuity of the model during the test-time adaptation process. +Different from the previous methods initialized by a given trained model, we introduce a smoothing strategy to leverage historical models for better generalization. +We conduct experiments with several representative learning-based backbones on the public histological dataset, demonstrating the superior registration performance of our test-time adaptation method. \ No newline at end of file diff --git a/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation b/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation new file mode 100644 index 0000000000..9158adf2ac --- /dev/null +++ b/data/2024/aaai/Test-Time Personalization with Meta Prompt for Gaze Estimation @@ -0,0 +1 @@ +Despite the recent remarkable achievement in gaze estimation, efficient and accurate personalization of gaze estimation without labels is a practical problem but rarely touched on in the literature. To achieve efficient personalization, we take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. Specifically, the prompt is additionally attached without perturbing original network and can contain less than 1% of a ResNet-18's parameters. Our experiments show high efficiency of the prompt tuning approach. The proposed one can be 10 times faster in terms of adaptation speed than the methods compared. However, it is non-trivial to update the prompt for personalized gaze estimation without labels. At the test time, it is essential to ensure that the minimizing of particular unsupervised loss leads to the goals of minimizing gaze estimation error. To address this difficulty, we propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss. In addition, we experiment on four cross-dataset validations to show the remarkable advantages of the proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Testing Self-Reducible Samplers b/data/2024/aaai/Testing Self-Reducible Samplers new file mode 100644 index 0000000000..e50499f1af --- /dev/null +++ b/data/2024/aaai/Testing Self-Reducible Samplers @@ -0,0 +1,6 @@ +Samplers are the backbone of the implementations of any randomized algorithm. Unfortunately, obtaining an efficient algorithm to test the correctness of samplers is very hard to find. Recently, in a series of works, testers like Barbarik, Teq, Flash for testing of some particular kinds of samplers, like CNF-samplers and Horn-samplers, were obtained. However, their techniques have a significant limitation because one can not expect to use their methods to test for other samplers, such as perfect matching samplers or samplers for sampling linear extensions in posets. +In this paper, we present a new testing algorithm that works for such samplers and can estimate the distance of a new sampler from a known sampler (say, the uniform sampler). + +Testing the identity of distributions is the heart of testing the correctness of samplers. This paper's main technical contribution is developing a new distance estimation algorithm for distributions over high-dimensional cubes using the recently proposed subcube conditioning sampling model. Given subcube conditioning access to an unknown distribution P, and a known distribution Q defined over an n-dimensional Boolean hypercube, our algorithm CubeProbeEst estimates the variation distance between P and Q within additive error using subcube conditional samples from P. Following the testing-via-learning paradigm, we also get a tester that distinguishes between the cases when P and Q are close or far in variation distance with high probability using subcube conditional samples. + +This estimation algorithm CubeProbeEst in the subcube conditioning sampling model helps us to design the first tester for self-reducible samplers. The correctness of the tester is formally proved. Moreover, we implement CubeProbeEst to test the quality of three samplers for sampling linear extensions in posets. \ No newline at end of file diff --git a/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models b/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models new file mode 100644 index 0000000000..789ef25232 --- /dev/null +++ b/data/2024/aaai/TexFit: Text-Driven Fashion Image Editing with Diffusion Models @@ -0,0 +1 @@ +Fashion image editing aims to edit an input image to obtain richer or distinct visual clothing matching effects. Existing global fashion image editing methods are difficult to achieve rich outfit combination effects while local fashion image editing is more in line with the needs of diverse and personalized outfit matching. The local editing techniques typically depend on text and auxiliary modalities (e.g., human poses, human keypoints, garment sketches, etc.) for image manipulation, where the auxiliary modalities essentially assist in locating the editing region. Since these auxiliary modalities usually involve additional efforts in practical application scenarios, text-driven fashion image editing shows high flexibility. In this paper, we propose TexFit, a Text-driven Fashion image Editing method using diffusion models, which performs the local image editing only with the easily accessible text. Our approach employs a text-based editing region location module to predict precise editing region in the fashion image. Then, we take the predicted region as the generation condition of diffusion models together with the text prompt to achieve precise local editing of fashion images while keeping the rest part intact. In addition, previous fashion datasets usually focus on global description, lacking local descriptive information that can guide the precise local editing. Therefore, we develop a new DFMM-Spotlight dataset by using region extraction and attribute combination strategies. It focuses locally on clothes and accessories, enabling local editing with text input. Experimental results on the DFMM-Spotlight dataset demonstrate the effectiveness of our model. Code and Datasets are available at https://texfit.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/Text Diffusion with Reinforced Conditioning b/data/2024/aaai/Text Diffusion with Reinforced Conditioning new file mode 100644 index 0000000000..bb409c4370 --- /dev/null +++ b/data/2024/aaai/Text Diffusion with Reinforced Conditioning @@ -0,0 +1 @@ +Diffusion models have demonstrated exceptional capability in generating high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TReC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TReC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples. \ No newline at end of file diff --git a/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models b/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models new file mode 100644 index 0000000000..4b550013c1 --- /dev/null +++ b/data/2024/aaai/Text Image Inpainting via Global Structure-Guided Diffusion Models @@ -0,0 +1 @@ +Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: https://github.com/blackprotoss/GSDM. \ No newline at end of file diff --git a/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning b/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning new file mode 100644 index 0000000000..635c1edf6a --- /dev/null +++ b/data/2024/aaai/Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning @@ -0,0 +1 @@ +Text-based Person Re-identification (T-ReID), which aims at retrieving a specific pedestrian image from a collection of images via text-based information, has received significant attention. However, previous research has overlooked a challenging yet practical form of T-ReID: dealing with image galleries mixed with occluded and inconsistent personal visuals, instead of ideal visuals with a full-body and clear view. Its major challenges lay in the insufficiency of benchmark datasets and the enlarged semantic gap incurred by arbitrary occlusions and modality gap between text description and visual representation of the target person. To alleviate these issues, we first design an Occlusion Generator (OGor) for the automatic generation of artificial occluded images from generic surveillance images. Then, a fine-granularity token selection mechanism is proposed to minimize the negative impact of occlusion for robust feature learning, and a novel multi-granularity contrastive consistency alignment framework is designed to leverage intra-/inter-granularity of visual-text representations for semantic alignment of occluded visuals and query texts. Experimental results demonstrate that our method exhibits superior performance. We believe this work could inspire the community to investigate more dedicated designs for implementing T-ReID in real-world scenarios. The source code is available at https://github.com/littlexinyi/MGCC. \ No newline at end of file diff --git a/data/2024/aaai/Text-to-Image Generation for Abstract Concepts b/data/2024/aaai/Text-to-Image Generation for Abstract Concepts new file mode 100644 index 0000000000..85d35155db --- /dev/null +++ b/data/2024/aaai/Text-to-Image Generation for Abstract Concepts @@ -0,0 +1 @@ +Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort since they are characterized by intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and often struggle to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts. \ No newline at end of file diff --git a/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries b/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries new file mode 100644 index 0000000000..2437af71d9 --- /dev/null +++ b/data/2024/aaai/Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries @@ -0,0 +1 @@ +Tabular data analysis is crucial in various fields, and large language models show promise in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like forecasting and chart generation. To address this gap, we developed the Text2Analysis benchmark, incorporating advanced analysis tasks that go beyond the SQL-compatible operations and require more in-depth analysis. We also develop five innovative and effective annotation methods, harnessing the capabilities of large language models to enhance data quality and quantity. Additionally, we include unclear queries that resemble real-world user questions to test how well models can understand and tackle such challenges. Finally, we collect 2249 query-result pairs with 347 tables. We evaluate five state-of-the-art models using three different metrics and the results show that our benchmark presents introduces considerable challenge in the field of tabular data analysis, paving the way for more advanced research opportunities. \ No newline at end of file diff --git a/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration b/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration new file mode 100644 index 0000000000..2a49174958 --- /dev/null +++ b/data/2024/aaai/Text2City: One-Stage Text-Driven Urban Layout Regeneration @@ -0,0 +1 @@ +Regenerating urban layout is an essential process for urban regeneration. In this paper, we propose a new task called text-driven urban layout regeneration, which provides an intuitive input modal - text - for users to specify the regeneration, instead of designing complex rules. Given the target region to be regenerated, we propose a one-stage text-driven urban layout regeneration model, Text2City, to jointly and progressively regenerate the urban layout (i.e., road and building layouts) based on textual layout descriptions and surrounding context (i.e., urban layouts and functions of the surrounding regions). Text2City first extracts road and building attributes from the textual layout description to guide the regeneration. It includes a novel one-stage joint regenerator network based on the conditioned denoising diffusion probabilistic models (DDPMs) and prior knowledge exchange. To harmonize the regenerated layouts through joint optimization, we propose the interactive & enhanced guidance module for self-enhancement and prior knowledge exchange between road and building layouts during the regeneration. We also design a series of constraints from attribute-, geometry- and pixel-levels to ensure rational urban layout generation. To train our model, we build a large-scale dataset containing urban layouts and layout descriptions, covering 147K regions. Qualitative and quantitative evaluations show that our proposed method outperforms the baseline methods in regenerating desirable urban layouts that meet the textual descriptions. \ No newline at end of file diff --git a/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis b/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis new file mode 100644 index 0000000000..34d3429664 --- /dev/null +++ b/data/2024/aaai/TextGT: A Double-View Graph Transformer on Text for Aspect-Based Sentiment Analysis @@ -0,0 +1 @@ +Aspect-based sentiment analysis (ABSA) is aimed at predicting the sentiment polarities of the aspects included in a sentence instead of the whole sentence itself, and is a fine-grained learning task compared to the conventional text classification. In recent years, on account of the ability to model the connectivity relationships between the words in one sentence, graph neural networks have been more and more popular to handle the natural language processing tasks, and meanwhile many works emerge for the ABSA task. However, most of the works utilizing graph convolution easily incur the over-smoothing problem, while graph Transformer for ABSA has not been explored yet. In addition, although some previous works are dedicated to using both GNN and Transformer to handle text, the methods of tightly combining graph view and sequence view of text is open to research. To address the above issues, we propose a double-view graph Transformer on text (TextGT) for ABSA. In TextGT, the procedure in graph view of text is handled by GNN layers, while Transformer layers deal with the sequence view, and these two processes are tightly coupled, alleviating the over-smoothing problem. Moreover, we propose an algorithm for implementing a kind of densely message passing graph convolution called TextGINConv, to employ edge features in graphs. Extensive experiments demonstrate the effectiveness of our TextGT over the state-of-the-art approaches, and validate the TextGINConv module. The source code is available at https://github.com/shuoyinn/TextGT. \ No newline at end of file diff --git a/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models b/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models new file mode 100644 index 0000000000..43dd21276c --- /dev/null +++ b/data/2024/aaai/The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models @@ -0,0 +1,9 @@ +Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. +Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. +However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. +To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. +We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. +Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. +Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. +In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. +This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice. \ No newline at end of file diff --git a/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract) b/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract) new file mode 100644 index 0000000000..6b5457f998 --- /dev/null +++ b/data/2024/aaai/The CoachAI Badminton Environment: A Novel Reinforcement Learning Environment with Realistic Opponents (Student Abstract) @@ -0,0 +1 @@ +The growing demand for precise sports analysis has been explored to improve athlete performance in various sports (e.g., basketball, soccer). However, existing methods for different sports face challenges in validating strategies in environments due to simple rule-based opponents leading to performance gaps when deployed in real-world matches. In this paper, we propose the CoachAI Badminton Environment, a novel reinforcement learning (RL) environment with realistic opponents for badminton, which serves as a compelling example of a turn-based game. It supports researchers in exploring various RL algorithms with the badminton context by integrating state-of-the-art tactical-forecasting models and real badminton game records. The Badminton Benchmarks are proposed with multiple widely adopted RL algorithms to benchmark the performance of simulating matches against real players. To advance novel algorithms and developments in badminton analytics, we make our environment open-source, enabling researchers to simulate more complex badminton sports scenarios based on this foundation. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/CoachAI%20Badminton%20Environment. \ No newline at end of file diff --git a/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games b/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games new file mode 100644 index 0000000000..26c241b848 --- /dev/null +++ b/data/2024/aaai/The CoachAI Badminton Environment: Bridging the Gap between a Reinforcement Learning Environment and Real-World Badminton Games @@ -0,0 +1 @@ +We present the CoachAI Badminton Environment, a reinforcement learning (RL) environment tailored for AI-driven sports analytics. In contrast to traditional environments using rule-based opponents or simplistic physics-based randomness, our environment integrates authentic opponent AIs and realistic randomness derived from real-world matches data to bridge the performance gap encountered in real-game deployments. This novel feature enables RL agents to seamlessly adapt to genuine scenarios. The CoachAI Badminton Environment empowers researchers to validate strategies in intricate real-world settings, offering: i) Realistic opponent simulation for RL training; ii) Visualizations for evaluation; and iii) Performance benchmarks for assessing agent capabilities. By bridging the RL environment with actual badminton games, our environment is able to advance the discovery of winning strategies for players. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/Strategic%20Environment. \ No newline at end of file diff --git a/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games b/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games new file mode 100644 index 0000000000..2e5abc0e8f --- /dev/null +++ b/data/2024/aaai/The Complexity of Computing Robust Mediated Equilibria in Ordinal Games @@ -0,0 +1,17 @@ +Usually, to apply game-theoretic methods, we must specify utilities +precisely, and we run the risk that the solutions we compute are not +robust to errors in this specification. Ordinal games provide an +attractive alternative: they require specifying only which outcomes +are preferred to which other ones. Unfortunately, they provide little +guidance for how to play unless there are pure Nash equilibria; +evaluating mixed strategies appears to fundamentally require cardinal +utilities. + +In this paper, we observe that we can in fact make good use of mixed +strategies in ordinal games if we consider settings that allow for +folk theorems. These allow us to find equilibria that are robust, in +the sense that they remain equilibria no matter which cardinal +utilities are the correct ones -- as long as they are consistent with +the specified ordinal preferences. We analyze this concept and study +the computational complexity of finding such equilibria in a range of +settings. \ No newline at end of file diff --git a/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities b/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities new file mode 100644 index 0000000000..e230f02ab8 --- /dev/null +++ b/data/2024/aaai/The Complexity of Fair Division of Indivisible Items with Externalities @@ -0,0 +1,2 @@ +We study the computational complexity of fairly allocating a set of indivisible items under externalities. In this recently-proposed setting, in addition to the utility the agent gets from their bundle, they also receive utility from items allocated to other agents. +We focus on the extended definitions of envy-freeness up to one item (EF1) and of envy-freeness up to any item (EFX), and we provide the landscape of their complexity for several different scenarios. We prove that it is NP-complete to decide whether there exists an EFX allocation, even when there are only three agents, or even when there are only six different values for the items. We complement these negative results by showing that when both the number of agents and the number of different values for items are bounded by a parameter the problem becomes fixed-parameter tractable. Furthermore, we prove that two-valued and binary-valued instances are equivalent and that EFX and EF1 allocations coincide for this class of instances. Finally, motivated from real-life scenarios, we focus on a class of structured valuation functions, which we term agent/item-correlated. We prove their equivalence to the "standard" setting without externalities. Therefore, all previous results for EF1 and EFX apply immediately for these valuations. \ No newline at end of file diff --git a/data/2024/aaai/The Complexity of Optimizing Atomic Congestion b/data/2024/aaai/The Complexity of Optimizing Atomic Congestion new file mode 100644 index 0000000000..b6fa0de86f --- /dev/null +++ b/data/2024/aaai/The Complexity of Optimizing Atomic Congestion @@ -0,0 +1 @@ +Atomic congestion games are a classic topic in network design, routing, and algorithmic game theory, and are capable of modeling congestion and flow optimization tasks in various application areas. While both the price of anarchy for such games as well as the computational complexity of computing their Nash equilibria are by now well-understood, the computational complexity of computing a system-optimal set of strategies - that is, a centrally planned routing that minimizes the average cost of agents - is severely understudied in the literature. We close this gap by identifying the exact boundaries of tractability for the problem through the lens of the parameterized complexity paradigm. After showing that the problem remains highly intractable even on extremely simple networks, we obtain a set of results which demonstrate that the structural parameters which control the computational (in)tractability of the problem are not vertex-separator based in nature (such as, e.g., treewidth), but rather based on edge separators. We conclude by extending our analysis towards the (even more challenging) min-max variant of the problem. \ No newline at end of file diff --git a/data/2024/aaai/The Defeat of the Winograd Schema Challenge (Abstract Reprint) b/data/2024/aaai/The Defeat of the Winograd Schema Challenge (Abstract Reprint) new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution b/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution new file mode 100644 index 0000000000..68277053fe --- /dev/null +++ b/data/2024/aaai/The Evidence Contraction Issue in Deep Evidential Regression: Discussion and Solution @@ -0,0 +1 @@ +Deep Evidential Regression (DER) places a prior on the original Gaussian likelihood and treats learning as an evidence acquisition process to quantify uncertainty. For the validity of the evidence theory, DER requires specialized activation functions to ensure that the prior parameters remain non-negative. However, such constraints will trigger evidence contraction, causing sub-optimal performance. In this paper, we analyse DER theoretically, revealing the intrinsic limitations for sub-optimal performance: the non-negativity constraints on the Normal Inverse-Gamma (NIG) prior parameter trigger the evidence contraction under the specialized activation function, which hinders the optimization of DER performance. On this basis, we design a Non-saturating Uncertainty Regularization term, which effectively ensures that the performance is further optimized in the right direction. Experiments on real-world datasets show that our proposed approach improves the performance of DER while maintaining the ability to quantify uncertainty. \ No newline at end of file diff --git a/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank b/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank new file mode 100644 index 0000000000..7017fe7f81 --- /dev/null +++ b/data/2024/aaai/The Expected Loss of Preconditioned Langevin Dynamics Reveals the Hessian Rank @@ -0,0 +1 @@ +Langevin dynamics (LD) is widely used for sampling from distributions and for optimization. In this work, we derive a closed-form expression for the expected loss of preconditioned LD near stationary points of the objective function. We use the fact that at the vicinity of such points, LD reduces to an Ornstein–Uhlenbeck process, which is amenable to convenient mathematical treatment. Our analysis reveals that when the preconditioning matrix satisfies a particular relation with respect to the noise covariance, LD's expected loss becomes proportional to the rank of the objective's Hessian. We illustrate the applicability of this result in the context of neural networks, where the Hessian rank has been shown to capture the complexity of the predictor function but is usually computationally hard to probe. Finally, we use our analysis to compare SGD-like and Adam-like preconditioners and identify the regimes under which each of them leads to a lower expected loss. \ No newline at end of file diff --git a/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning b/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning new file mode 100644 index 0000000000..4c8932cc6c --- /dev/null +++ b/data/2024/aaai/The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning @@ -0,0 +1 @@ +The advent of powerful transformer-based discriminative language models and, more recently, generative GPT-family models, has led to notable advancements in natural language processing (NLP), particularly in commonsense reasoning tasks. One such task is commonsense reasoning, where performance is usually evaluated through multiple-choice question-answering benchmarks. Till date, many such benchmarks have been proposed and `leaderboards' tracking state-of-the-art performance on those benchmarks suggest that transformer-based models are approaching human-like performance. However, due to documented problems such as hallucination and bias, the research focus is shifting from merely quantifying accuracy on the task to an in-depth, context-sensitive probing of LLMs' generalization and robustness. To gain deeper insight into diagnosing these models' performance in commonsense reasoning scenarios, this thesis addresses three main studies: the generalization ability of transformer-based language models on commonsense reasoning, the trend in confidence distribution of these language models confronted with ambiguous inference tasks, and a proposed risk-centric evaluation framework for both discriminative and generative language models. \ No newline at end of file diff --git a/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning b/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning new file mode 100644 index 0000000000..6964f6b357 --- /dev/null +++ b/data/2024/aaai/The Inter-batch Diversity of Samples in Experience Replay for Continual Learning @@ -0,0 +1 @@ +In a Continual Learning setting, models are trained on data with occasional distribution shifts, resulting in forgetting the information learned before each shift. Experience Replay (ER) addresses this challenge by retaining part of the old training samples and replaying them alongside current data, improving the model's understanding of the overall distribution in training batches. The crucial factor in ER performance is the diversity of samples within batches. The impact of sample diversity across a sequence of batches is investigated, introducing a new metric and an associated approach to assess and leverage this diversity. This exploration opens up significant potential for future work, as various strategies can be devised to ensure inter-batch diversity. Achieving optimal results may involve striking a balance between this novel metric and other inherent properties of a batch or sequence. \ No newline at end of file diff --git a/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models b/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models new file mode 100644 index 0000000000..efc16a2beb --- /dev/null +++ b/data/2024/aaai/The Irrelevance of Influencers: Information Diffusion with Re-Activation and Immunity Lasts Exponentially Long on Social Network Models @@ -0,0 +1,3 @@ +Information diffusion models on networks are at the forefront of AI research. The dynamics of such models typically follow stochastic models from epidemiology, used to model not only infections but various phenomena, including the behavior of computer viruses and viral marketing campaigns. A core question in this setting is how to efficiently detect the most influential vertices in the host graph such that the infection survives the longest. In processes that incorporate re-infection of the vertices, such as the SIS process, theoretical studies identify parameter thresholds where the survival time of the process rapidly transitions from logarithmic to super-polynomial. These results contradict the intuition that the starting configuration is relevant, since the process will always either die out fast or survive almost indefinitely. A shortcoming of these results is that models incorporating short-term immunity (or creative advertisement fatigue) have not been subjected to such a theoretical analysis so far. + +We reduce this gap in the literature by studying the SIRS process, a more realistic model, which besides re-infection additionally incorporates short-term immunity. On complex network models, we identify parameter regimes for which the process survives exponentially long, and we get a tight threshold for random graphs. Underlying these results is our main technical contribution, showing a threshold behavior for the survival time of the SIRS process on graphs with large expander subgraphs, such as social network models. \ No newline at end of file diff --git a/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract) b/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract) new file mode 100644 index 0000000000..dcd7dd936d --- /dev/null +++ b/data/2024/aaai/The Language Model Can Have the Personality: Joint Learning for Personality Enhanced Language Model (Student Abstract) @@ -0,0 +1,2 @@ +With the introduction of large language models, chatbots are becoming more conversational to communicate effectively and capable of handling increasingly complex tasks. To make a chatbot more relatable and engaging, we propose a new language model idea that maps the human-like personality. +In this paper, we propose a systematic Personality-Enhanced Language Model (PELM) approach by using a joint learning mechanism of personality classification and language generation tasks. The proposed PELM leverages a dataset of defined personality typology, Myers-Briggs Type Indicator, and produces a Personality-Enhanced Language Model by using a joint learning and cross-teaching structure consisting of a classification and language modelling to incorporate personalities via both distinctive types and textual information. The results show that PELM can generate better personality-based outputs than baseline models. \ No newline at end of file diff --git a/data/2024/aaai/The Logic of Doxastic Strategies b/data/2024/aaai/The Logic of Doxastic Strategies new file mode 100644 index 0000000000..8ce2c7c15a --- /dev/null +++ b/data/2024/aaai/The Logic of Doxastic Strategies @@ -0,0 +1,3 @@ +In many real-world situations, there is often not enough information to know that a certain strategy will succeed in achieving the goal, but there is a good reason to believe that it will. The paper introduces the term "doxastic" for such strategies. + +The main technical contribution is a sound and complete logical system that describes the interplay between doxastic strategy and belief modalities. \ No newline at end of file diff --git a/data/2024/aaai/The Moderating Effect of Instant Runoff Voting b/data/2024/aaai/The Moderating Effect of Instant Runoff Voting new file mode 100644 index 0000000000..4149271843 --- /dev/null +++ b/data/2024/aaai/The Moderating Effect of Instant Runoff Voting @@ -0,0 +1 @@ +Instant runoff voting (IRV) has recently gained popularity as an alternative to plurality voting for political elections, with advocates claiming a range of advantages, including that it produces more moderate winners than plurality and could thus help address polarization. However, there is little theoretical backing for this claim, with existing evidence focused on case studies and simulations. In this work, we prove that IRV has a moderating effect relative to plurality voting in a precise sense, developed in a 1-dimensional Euclidean model of voter preferences. We develop a theory of exclusion zones, derived from properties of the voter distribution, which serve to show how moderate and extreme candidates interact during IRV vote tabulation. The theory allows us to prove that if voters are symmetrically distributed and not too concentrated at the extremes, IRV cannot elect an extreme candidate over a moderate. In contrast, we show plurality can and validate our results computationally. Our methods provide new frameworks for the analysis of voting systems, deriving exact winner distributions geometrically and establishing a connection between plurality voting and stick-breaking processes. \ No newline at end of file diff --git a/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training b/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training new file mode 100644 index 0000000000..dc6f6b0d36 --- /dev/null +++ b/data/2024/aaai/The Promise of Serverless Computing within Peer-to-Peer Architectures for Distributed ML Training @@ -0,0 +1 @@ +My thesis focuses on the integration of serverless computing with Peer to Peer (P2P) architectures in distributed Machine Learning (ML). This research aims to harness the decentralized, resilient nature of P2P systems, combined with the scalability and automation of serverless platforms. We explore using databases not just for communication but also for in-database model updates and gradient averaging, addressing the challenges of statelessness in serverless environments. \ No newline at end of file diff --git a/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly b/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly new file mode 100644 index 0000000000..2f3053bdc6 --- /dev/null +++ b/data/2024/aaai/The Role of Over-Parameterization in Machine Learning - the Good, the Bad, the Ugly @@ -0,0 +1,3 @@ +The conventional wisdom of simple models in machine learning misses the bigger picture, especially over-parameterized neural networks (NNs), where the number of parameters are much larger than the number of training data. Our goal is to explore the mystery behind over-parameterized models from a theoretical side. + +In this talk, I will discuss the role of over-parameterization in neural networks, to theoretically understand why they can perform well. First, I will discuss the role of over-parameterization in neural networks from the perspective of models, to theoretically understand why they can genralize well. Second, the effects of over-parameterization in robustness, privacy are discussed. Third, I will talk about the over-parameterization from kernel methods to neural networks in a function space theory view. Besides, from classical statistical learning to sequential decision making, I will talk about the benefits of over-parameterization on how deep reinforcement learning works well for function approximation. Potential future directions on theory of over-parameterization ML will also be discussed. \ No newline at end of file diff --git a/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education b/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education new file mode 100644 index 0000000000..85f20c03ba --- /dev/null +++ b/data/2024/aaai/The Virtual Driving Instructor: Multi-Agent System Collaborating via Knowledge Graph for Scalable Driver Education @@ -0,0 +1,6 @@ +This paper introduces the design, development, and deployment of a Virtual Driving Instructor (VDI) for enhanced driver education. +The VDI provides personalized, real-time feedback to students in a driving simulator, addressing some of the limitations of traditional driver instruction. +Employing a hybrid AI system, the VDI combines rule-based agents, learning-based agents, knowledge graphs, and Bayesian networks to assess and monitor student performance in a comprehensive manner. +Implemented in multiple simulators at a driving school in Norway, the system aims to leverage AI and driving simulation to improve both the learning experience and the efficiency of instruction. +Initial feedback from students has been largely positive, highlighting the effectiveness of this integration while also pointing to areas for further improvement. +This work marks a significant stride in infusing technology into driver education, offering a scalable and efficient approach to instruction. \ No newline at end of file diff --git a/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover b/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover new file mode 100644 index 0000000000..8a01163ff6 --- /dev/null +++ b/data/2024/aaai/Theoretical Aspects of Generating Instances with Unique Solutions: Pre-assignment Models for Unique Vertex Cover @@ -0,0 +1,3 @@ +The uniqueness of an optimal solution to a combinatorial optimization problem attracts many fields of researchers' attention because it has a wide range of applications, it is related to important classes in computational complexity, and the existence of only one solution is often critical for algorithm designs in theory. However, as the authors know, there is no major benchmark set consisting of only instances with unique solutions, and no algorithm generating instances with unique solutions is known; a systematic approach to getting a problem instance guaranteed having a unique solution would be helpful. A possible approach is as follows: Given a problem instance, we specify a small part of a solution in advance so that only one optimal solution meets the specification. This paper formulates such a ``pre-assignment'' approach for the vertex cover problem as a typical combinatorial optimization problem and discusses its computational complexity. +First, we show that the problem is ΣP2-complete in general, while the problem becomes NP-complete when an input graph is bipartite. +We then present an O(2.1996^n)-time algorithm for general graphs and an O(1.9181^n)-time algorithm for bipartite graphs, where n is the number of vertices. The latter is based on an FPT algorithm with O*(3.6791^τ) time for vertex cover number τ. Furthermore, we show that the problem for trees can be solved in O(1.4143^n) time. \ No newline at end of file diff --git a/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving b/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving new file mode 100644 index 0000000000..ba2cc44307 --- /dev/null +++ b/data/2024/aaai/Theoretical and Empirical Analysis of Cost-Function Merging for Implicit Hitting Set WCSP Solving @@ -0,0 +1,2 @@ +The Implicit Hitting Set (HS) approach has shown very effective for MaxSAT solving. However, only preliminary promising results have been obtained for the very similar Weighted CSP framework. In this paper we contribute towards both a better theoretical understanding of the HS approach and a more effective HS-based solvers for WCSP. First, we bound the minimum number of iterations of HS thanks to what we call distinguished cores. Then, we show a source of inefficiency by +introducing two simple problems where HS is unfeasible. Next, we propose two reformulation methods that merge cost-functions to overcome the problem. We provide a theoretical analysis that quantifies the magnitude of the improvement of each method with respect to the number of iterations of the algorithm. In particular, we show that the reformulations can bring an exponential number of iterations down to a constant number in our working examples. Finally, we complement our theoretical analysis with two sets of experiments. First, we show that our results are aligned with real executions. Second, and most importantly, we conduct experiments on typical benchmark problems and show that cost-function merging may be heuristically applied and it may accelerate HS algorithms by several orders of magnitude. In some cases, it even outperforms state-of-the-art solvers. \ No newline at end of file diff --git a/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems b/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems new file mode 100644 index 0000000000..46d6427714 --- /dev/null +++ b/data/2024/aaai/Thesis Summary: Operationalizing User-Inclusive Transparency in Artificial Intelligence Systems @@ -0,0 +1 @@ +Artificial intelligence system architects can increase user trust by designing systems that are inherently transparent. We propose the idea of representing an AI system as an amalgamation of the AI Model (algorithms), data (input and output, including outcomes), and the user interface with visual interpretations (e.g. graphs, Venn diagrams). By designing human controls and feedback mechanisms for AI systems that allow users to exert control over them we can integrate transparency into existing user interfaces. Our plan is to design prototypes of transparent user interfaces for AI systems using well-known usability principles. By conducting surveys we will study their impact to see if these principles help the user to work with the AI system with confidence and if the user perceives the system to be adequately transparent. \ No newline at end of file diff --git a/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit b/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit new file mode 100644 index 0000000000..a581b30e24 --- /dev/null +++ b/data/2024/aaai/Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit @@ -0,0 +1 @@ +We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given stochastic arms, and the reward of each arm follows an unknown distribution. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal action from a finite-sized real-valued action set with as few arm pulls as possible. Previous methods in the R-CPE-MAB require enumerating all of the feasible actions of the combinatorial optimization problem one is considering. In general, since the size of the action set grows exponentially large with respect to the number of arms, this is almost practically impossible when the number of arms is large. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large with respect to the number of arms. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor. \ No newline at end of file diff --git a/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning b/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning new file mode 100644 index 0000000000..9e698bebe9 --- /dev/null +++ b/data/2024/aaai/Three Heads Are Better than One: Complementary Experts for Long-Tailed Semi-supervised Learning @@ -0,0 +1 @@ +We address the challenging problem of Long-Tailed Semi-Supervised Learning (LTSSL) where labeled data exhibit imbalanced class distribution and unlabeled data follow an unknown distribution. Unlike in balanced SSL, the generated pseudo-labels are skewed towards head classes, intensifying the training bias. Such a phenomenon is even amplified as more unlabeled data will be mislabeled as head classes when the class distribution of labeled and unlabeled datasets are mismatched. To solve this problem, we propose a novel method named ComPlementary Experts (CPE). Specifically, we train multiple experts to model various class distributions, each of them yielding high-quality pseudo-labels within one form of class distribution. Besides, we introduce Classwise Batch Normalization for CPE to avoid performance degradation caused by feature distribution mismatch between head and non-head classes. CPE achieves state-of-the-art performances on CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT dataset benchmarks. For instance, on CIFAR-10-LT, CPE improves test accuracy by over >2.22% compared to baselines. Code is available at https://github.com/machengcheng2016/CPE-LTSSL. \ No newline at end of file diff --git a/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network b/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network new file mode 100644 index 0000000000..896312c5b6 --- /dev/null +++ b/data/2024/aaai/Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network @@ -0,0 +1 @@ +Cross-domain named entity recognition (NER) tasks encourage NER models to transfer knowledge from data-rich source domains to sparsely labeled target domains. Previous works adopt the paradigms of pre-training on the source domain followed by fine-tuning on the target domain. However, these works ignore that general labeled NER source domain data can be easily retrieved in the real world, and soliciting more source domains could bring more benefits. Unfortunately, previous paradigms cannot efficiently transfer knowledge from multiple source domains. In this work, to transfer multiple source domains' knowledge, we decouple the NER task into the pipeline tasks of mention detection and entity typing, where the mention detection unifies the training object across domains, thus providing the entity typing with higher-quality entity mentions. Additionally, we request multiple general source domain models to suggest the potential named entities for sentences in the target domain explicitly, and transfer their knowledge to the target domain models through the knowledge progressive networks implicitly. Furthermore, we propose two methods to analyze in which source domain knowledge transfer occurs, thus helping us judge which source domain brings the greatest benefit. In our experiment, we develop a Chinese cross-domain NER dataset. Our model improved the F1 score by an average of 12.50% across 8 Chinese and English datasets compared to models without source domain data. \ No newline at end of file diff --git a/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem b/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem new file mode 100644 index 0000000000..1141c54bd6 --- /dev/null +++ b/data/2024/aaai/Threshold-Based Responsive Simulated Annealing for Directed Feedback Vertex Set Problem @@ -0,0 +1 @@ +As a classical NP-hard problem and the topic of the PACE 2022 competition, the directed feedback vertex set problem (DFVSP) aims to find a minimum subset of vertices such that, when vertices in the subset and all their adjacent edges are removed from the directed graph, the remainder graph is acyclic. In this paper, we propose a threshold-based responsive simulated annealing algorithm called TRSA for solving DFVSP. First, we simplify the problem instances with two new reduction rules proposed in this paper and eight reduction rules from the literature. Then, based on a new solution representation, TRSA solves DFVSP with a fast local search procedure featured by a swap-based neighborhood structure and three neighborhood acceleration strategies. Finally, all these strategies are incorporated into a threshold-based responsive simulated annealing framework. Computational experiments on 140 benchmark instances show that TRSA is highly competitive compared to the state-of-the-art methods. Specifically, TRSA can improve the best known results for 53 instances, while matching the best known results for 79 ones. Furthermore, some important features of TRSA are analyzed to identify its success factors. \ No newline at end of file diff --git a/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training b/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training new file mode 100644 index 0000000000..82075c485e --- /dev/null +++ b/data/2024/aaai/TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training @@ -0,0 +1 @@ +Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main. \ No newline at end of file diff --git a/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation b/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation new file mode 100644 index 0000000000..6e1fbffe9c --- /dev/null +++ b/data/2024/aaai/Tiered Coalition Formation Game Stability and Simulation @@ -0,0 +1 @@ +Expanding on a 2017 paper by Siler that introduced tiered coalition formation games, I have introduced a variant game and examined the stabilizability of both the original game and its variant. My thesis will contain further theoretical stability findings and the results and interpretation of a simulation based upon real data from video game matchups. \ No newline at end of file diff --git a/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence b/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence new file mode 100644 index 0000000000..b1942a0d48 --- /dev/null +++ b/data/2024/aaai/Time-Aware Knowledge Representations of Dynamic Objects with Multidimensional Persistence @@ -0,0 +1,3 @@ +Learning time-evolving objects such as multivariate time series and dynamic networks requires the development of novel knowledge representation mechanisms and neural network architectures, which allow for capturing implicit time-dependent information contained in the data. Such information is typically not directly observed but plays a key role in the learning task performance. In turn, lack of time dimension in knowledge encoding mechanisms for time-dependent data leads to frequent model updates, poor learning performance, and, as a result, subpar decision-making. Here we propose a new approach to a time-aware knowledge representation mechanism that notably focuses on implicit time-dependent topological information along multiple geometric dimensions. In particular, we propose a new approach, named Temporal MultiPersistence (TMP), which produces multidimensional topological fingerprints of the data by using the existing single parameter topological summaries. The main idea behind TMP is to merge the two newest directions in topological representation learning, that is, multi-persistence which simultaneously describes data shape evolution along multiple key parameters, and zigzag persistence to enable us to extract the most salient data shape information over time. + +We derive theoretical guarantees of TMP vectorizations and show its utility, in application to forecasting on benchmark traffic flow, Ethereum blockchain, and electrocardiogram datasets, demonstrating the competitive performance, especially, in scenarios of limited data records. In addition, our TMP method improves the computational efficiency of the state-of-the-art multipersistence summaries up to 59.5 times. \ No newline at end of file diff --git a/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations b/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations new file mode 100644 index 0000000000..92515b0a07 --- /dev/null +++ b/data/2024/aaai/To Know the Causes of Things: Text Mining for Causal Relations @@ -0,0 +1 @@ +Causality expresses the relation between two arguments, one of which represents the cause and the other the effect (or consequence). Causal text mining refers to the extraction and usage of causal information from text. Given an input sequence, we are interested to know if and where causal information occurs. My research is focused on the end-to-end challenges of causal text mining. This involves extracting, representing, and applying causal knowledge from unstructured text. The corresponding research questions are: (1) How to extract causal information from unstructured text effectively? (2) How to represent extracted causal relationships in a graph that is interpretable and useful for some application? (3) How can we capitalize on extracted causal knowledge for downstream tasks? What tasks or fields will benefit from such knowledge? In this paper, I outline past and on-going works, and highlight future research challenges. \ No newline at end of file diff --git a/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition b/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition new file mode 100644 index 0000000000..939f9f3594 --- /dev/null +++ b/data/2024/aaai/Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition @@ -0,0 +1 @@ +Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP. \ No newline at end of file diff --git a/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models b/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models new file mode 100644 index 0000000000..6dd1233ed7 --- /dev/null +++ b/data/2024/aaai/Tools Identification By On-Board Adaptation of Vision-and-Language Models @@ -0,0 +1 @@ +A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio. \ No newline at end of file diff --git a/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation b/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation new file mode 100644 index 0000000000..bffacc6381 --- /dev/null +++ b/data/2024/aaai/Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation @@ -0,0 +1,5 @@ +This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. +From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. +The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. +Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. +Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE. \ No newline at end of file diff --git a/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning b/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning new file mode 100644 index 0000000000..a884cfc2f3 --- /dev/null +++ b/data/2024/aaai/TopoGCL: Topological Graph Contrastive Learning @@ -0,0 +1 @@ +Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higher-order graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 8 out of 12 considered datasets and also exhibits robustness under noisy scenarios. \ No newline at end of file diff --git a/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract) b/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract) new file mode 100644 index 0000000000..990c732f75 --- /dev/null +++ b/data/2024/aaai/Topological and Node Noise Filtering on 3D Meshes Using Graph Neural Networks (Student Abstract) @@ -0,0 +1 @@ +Topological and node noise filtration are typically considered separately. Graph Neural Networks (GNN) are commonly used for node noise filtration, as they offer high efficiency and low exploitation costs. This paper explores the solution of joint node and topological noise filtration through the use of graph neural networks. Since treating a 3D mesh as a graph is challenging, an indicator function grid representation is employed as input for GNNs to perform the joint filtering. The resulting machine learning model is inspired by point cloud to mesh reconstruction algorithms and demonstrates low computational requirements during inference, producing successful results for smooth, watertight 3D models. \ No newline at end of file diff --git a/data/2024/aaai/Toward More Generalized Malicious URL Detection Models b/data/2024/aaai/Toward More Generalized Malicious URL Detection Models new file mode 100644 index 0000000000..ca4e0f1de9 --- /dev/null +++ b/data/2024/aaai/Toward More Generalized Malicious URL Detection Models @@ -0,0 +1 @@ +This paper reveals a data bias issue that can profoundly hinder the performance of machine learning models in malicious URL detection. We describe how such bias can be diagnosed using interpretable machine learning techniques and further argue that such biases naturally exist in the real world security data for training a classification model. To counteract these challenges, we propose a debiased training strategy that can be applied to most deep-learning based models to alleviate the negative effects of the biased features. The solution is based on the technique of adversarial training to train deep neural networks learning invariant embedding from biased data. Through extensive experimentation, we substantiate that our innovative strategy fosters superior generalization capabilities across both CNN-based and RNN-based detection models. The findings presented in this work not only expose a latent issue in the field but also provide an actionable remedy, marking a significant step forward in the pursuit of more reliable and robust malicious URL detection. \ No newline at end of file diff --git a/data/2024/aaai/Toward Open-Set Human Object Interaction Detection b/data/2024/aaai/Toward Open-Set Human Object Interaction Detection new file mode 100644 index 0000000000..eab6227d80 --- /dev/null +++ b/data/2024/aaai/Toward Open-Set Human Object Interaction Detection @@ -0,0 +1 @@ +This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes. \ No newline at end of file diff --git a/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise b/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise new file mode 100644 index 0000000000..8670d1967c --- /dev/null +++ b/data/2024/aaai/Toward Robustness in Multi-Label Classification: A Data Augmentation Strategy against Imbalance and Noise @@ -0,0 +1 @@ +Multi-label classification poses challenges due to imbalanced and noisy labels in training data. In this paper, we propose a unified data augmentation method, named BalanceMix, to address these challenges. Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity. It also refines multi-labels at the label-wise granularity, categorizing noisy labels as clean, re-labeled, or ambiguous for robust optimization. Extensive experiments on three benchmark datasets demonstrate that BalanceMix outperforms existing state-of-the-art methods. We release the code at https://github.com/DISL-Lab/BalanceMix. \ No newline at end of file diff --git a/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset b/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset new file mode 100644 index 0000000000..5fa0ba5d50 --- /dev/null +++ b/data/2024/aaai/Towards Automated Chinese Ancient Character Restoration: A Diffusion-Based Method with a New Dataset @@ -0,0 +1 @@ +Automated Chinese ancient character restoration (ACACR) remains a challenging task due to its historical significance and aesthetic complexity. Existing methods are constrained by non-professional masks and even overfitting when training on small-scale datasets, which hinder their interdisciplinary application to traditional fields. In this paper, we are proud to introduce the Chinese Ancient Rubbing and Manuscript Character Dataset (ARMCD), which consists of 15,553 real-world ancient single-character images with 42 rubbings and manuscripts, covering the works of over 200 calligraphy artists spanning from 200 to 1,800 AD. We are also dedicated to providing professional synthetic masks by extracting localized erosion from real eroded images. Moreover, we propose DiffACR (Diffusion model for automated Chinese Ancient Character Restoration), a diffusion-based method for the ACACR task. Specifically, we regard the synthesis of eroded images as a special form of cold diffusion on uneroded ones and extract the prior mask directly from the eroded images. Our experiments demonstrate that our method comprehensively outperforms most existing methods on the proposed ARMCD. Dataset and code are available at https://github.com/lhl322001/DiffACR. \ No newline at end of file diff --git a/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education b/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education new file mode 100644 index 0000000000..1d919e0891 --- /dev/null +++ b/data/2024/aaai/Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education @@ -0,0 +1 @@ +The recent large language models (LLMs), e.g., ChatGPT, have been able to generate human-like and fluent responses when provided with specific instructions. While admitting the convenience brought by technological advancement, educators also have concerns that students might leverage LLMs to complete their writing assignments and pass them off as their original work. Although many AI content detection studies have been conducted as a result of such concerns, most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated. In this study, we investigated AI content detection in a rarely explored yet realistic setting where the text to be detected is collaboratively written by human and generative LLMs (termed as hybrid text for simplicity). We first formalized the detection task as identifying the transition points between human-written content and AI-generated content from a given hybrid text (boundary detection). We constructed a hybrid essay dataset by partially and randomly removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) separated AI-generated content from human-written content during the encoder training process; and (2) calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two adjacent prototypes that have the furthest distance from each other. Through extensive experiments, we observed the following main findings: (1) the proposed approach consistently outperformed the baseline methods across different experiment settings; (2) the encoder training process (i.e., step 1 of the above two-step approach) can significantly boost the performance of the proposed approach; (3) when detecting boundaries for single-boundary hybrid essays, the proposed approach could be enhanced by adopting a relatively large prototype size (i.e., the number of sentences needed to calculate a prototype), leading to a 22% improvement (against the best baseline method) in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation. \ No newline at end of file diff --git a/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval b/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval new file mode 100644 index 0000000000..bfba9c9721 --- /dev/null +++ b/data/2024/aaai/Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval @@ -0,0 +1 @@ +Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, i.e., the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at https://github.com/lntzm/MESM. \ No newline at end of file diff --git a/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment b/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment new file mode 100644 index 0000000000..4fe56ab358 --- /dev/null +++ b/data/2024/aaai/Towards Building a Language-Independent Speech Scoring Assessment @@ -0,0 +1 @@ +Automatic speech scoring is crucial in language learning, providing targeted feedback to language learners by assessing pronunciation, fluency, and other speech qualities. However, the scarcity of human-labeled data for languages beyond English poses a significant challenge in developing such systems. In this work, we propose a Language-Independent scoring approach to evaluate speech without relying on labeled data in the target language. We introduce a multilingual speech scoring system that leverages representations from the wav2vec 2.0 XLSR model and a force-alignment technique based on CTC-Segmentation to construct speech features. These features are used to train a machine learning model to predict pronunciation and fluency scores. We demonstrate the potential of our method by predicting expert ratings on a speech dataset spanning five languages - English, French, Spanish, German and Portuguese, and comparing its performance against Language-Specific models trained individually on each language, as well as a jointly-trained model on all languages. Results indicate that our approach shows promise as an initial step towards a universal language independent speech scoring. \ No newline at end of file diff --git a/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders b/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders new file mode 100644 index 0000000000..fb448fb76f --- /dev/null +++ b/data/2024/aaai/Towards Compact 3D Representations via Point Feature Enhancement Masked Autoencoders @@ -0,0 +1 @@ +Learning 3D representation plays a critical role in masked autoencoder (MAE) based pre-training methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and heavily rely on massive cross-modal data pairs that are often unavailable, which hinders their applications in practice. Instead, single-modal methods with solely point clouds as input are preferred in real applications due to their simplicity and efficiency. However, such methods easily suffer from limited 3D representations with global random mask input. To learn compact 3D representations, we propose a simple yet effective Point Feature Enhancement Masked Autoencoders (Point-FEMAE), which mainly consists of a global branch and a local branch to capture latent semantic features. Specifically, to learn more compact features, a share-parameter Transformer encoder is introduced to extract point features from the global and local unmasked patches obtained by global random and local block mask strategies, followed by a specific decoder to reconstruct. Meanwhile, to further enhance features in the local branch, we propose a Local Enhancement Module with local patch convolution to perceive fine-grained local context at larger scales. Our method significantly improves the pre-training efficiency compared to cross-modal alternatives, and extensive downstream experiments underscore the state-of-the-art effectiveness, particularly outperforming our baseline (Point-MAE) by 5.16%, 5.00%, and 5.04% in three variants of ScanObjectNN, respectively. Code is available at https://github.com/zyh16143998882/AAAI24-PointFEMAE. \ No newline at end of file diff --git a/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation b/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation new file mode 100644 index 0000000000..779535a007 --- /dev/null +++ b/data/2024/aaai/Towards Continual Knowledge Graph Embedding via Incremental Distillation @@ -0,0 +1 @@ +Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges. To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge. However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods. On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs. On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs. First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning. By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features. Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation. Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge. Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score. More exploratory experiments validate the effectiveness of IncDE in proficiently learning new knowledge while preserving old knowledge across all time steps. \ No newline at end of file diff --git a/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding b/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding new file mode 100644 index 0000000000..fbd186dc5d --- /dev/null +++ b/data/2024/aaai/Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding @@ -0,0 +1 @@ +Deep neural networks are susceptible to catastrophic forgetting when trained on sequential tasks. Various continual learning (CL) methods often rely on exemplar buffers or/and network expansion for balancing model stability and plasticity, which, however, compromises their practical value due to privacy and memory concerns. Instead, this paper considers a strict yet realistic setting, where the training data from previous tasks is unavailable and the model size remains relatively constant during sequential training. To achieve such desiderata, we propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. This is achieved by the synergy between two key components: HSIC-Bottleneck Orthogonalization (HBO) implements non-overwritten parameter updates mediated by Hilbert-Schmidt independence criterion in an orthogonal space and EquiAngular Embedding (EAE) enhances decision boundary adaptation between old and new tasks with predefined basis vectors. Extensive experiments demonstrate that our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model. \ No newline at end of file diff --git a/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model b/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model new file mode 100644 index 0000000000..96fcb70ee1 --- /dev/null +++ b/data/2024/aaai/Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model @@ -0,0 +1 @@ +Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity. \ No newline at end of file diff --git a/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings b/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings new file mode 100644 index 0000000000..fcbea7b291 --- /dev/null +++ b/data/2024/aaai/Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings @@ -0,0 +1 @@ +In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoM-TP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories. \ No newline at end of file diff --git a/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective b/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective new file mode 100644 index 0000000000..104b852200 --- /dev/null +++ b/data/2024/aaai/Towards Dynamic Spatial-Temporal Graph Learning: A Decoupled Perspective @@ -0,0 +1 @@ +With the progress of urban transportation systems, a significant amount of high-quality traffic data is continuously collected through streaming manners, which has propelled the prosperity of the field of spatial-temporal graph prediction. In this paper, rather than solely focusing on designing powerful models for static graphs, we shift our focus to spatial-temporal graph prediction in the dynamic scenario, which involves a continuously expanding and evolving underlying graph. To address inherent challenges, a decoupled learning framework (DLF) is proposed in this paper, which consists of a spatial-temporal graph learning network (DSTG) with a specialized decoupling training strategy. Incorporating inductive biases of time-series structures, DSTG can interpret time dependencies into latent trend and seasonal terms. To enable prompt adaptation to the evolving distribution of the dynamic graph, our decoupling training strategy is devised to iteratively update these two types of patterns. Specifically, for learning seasonal patterns, we conduct thorough training for the model using a long time series (e.g., three months of data). To enhance the learning ability of the model, we also introduce the masked auto-encoding mechanism. During this period, we frequently update trend patterns to expand new information from dynamic graphs. Considering both effectiveness and efficiency, we develop a subnet sampling strategy to select a few representative nodes for fine-tuning the weights of the model. These sampled nodes cover unseen patterns and previously learned patterns. Experiments on dynamic spatial-temporal graph datasets further demonstrate the competitive performance, superior efficiency, and strong scalability of the proposed framework. \ No newline at end of file diff --git a/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution b/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution new file mode 100644 index 0000000000..dbcbe6ac69 --- /dev/null +++ b/data/2024/aaai/Towards Effective and General Graph Unlearning via Mutual Evolution @@ -0,0 +1 @@ +With the rapid advancement of AI applications, the growing needs for data privacy and model robustness have highlighted the importance of machine unlearning, especially in thriving graph-based scenarios. However, most existing graph unlearning strategies primarily rely on well-designed architectures or manual process, rendering them less user-friendly and posing challenges in terms of deployment efficiency. Furthermore, striking a balance between unlearning performance and framework generalization is also a pivotal concern. To address the above issues, we propose Mutual Evolution Graph Unlearning (MEGU), a new mutual evolution paradigm that simultaneously evolves the predictive and unlearning capacities of graph unlearning. By incorporating aforementioned two components, MEGU ensures complementary optimization in a unified training framework that aligns with the prediction and unlearning requirements. Extensive experiments on 9 graph benchmark datasets demonstrate the superior performance of MEGU in addressing unlearning requirements at the feature, node, and edge levels. Specifically, MEGU achieves average performance improvements of 2.7%, 2.5%, and 3.2% across these three levels of unlearning tasks when compared to state-of-the-art baselines. Furthermore, MEGU exhibits satisfactory training efficiency, reducing time and space overhead by an average of 159.8x and 9.6x, respectively, in comparison to retraining GNN from scratch. \ No newline at end of file diff --git a/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks b/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks new file mode 100644 index 0000000000..7b68fdd1db --- /dev/null +++ b/data/2024/aaai/Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks @@ -0,0 +1 @@ +Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing (InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times. Our code available at https://anonymous.4open.science/r/InstDiffEdit-C306 \ No newline at end of file diff --git a/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks b/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks new file mode 100644 index 0000000000..299bcd85c1 --- /dev/null +++ b/data/2024/aaai/Towards Efficient Verification of Quantized Neural Networks @@ -0,0 +1 @@ +Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient on-device inference with less power and memory. In this work, we propose a framework for formally verifying the properties of quantized neural networks. Our baseline technique is based on integer linear programming which guarantees both soundness and completeness. We then show how efficiency can be improved by utilizing gradient-based heuristic search methods and also bound-propagation techniques. We evaluate our approach on perception networks quantized with PyTorch. Our results show that we can verify quantized networks with better scalability and efficiency than the previous state of the art. \ No newline at end of file diff --git a/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning b/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning new file mode 100644 index 0000000000..bf0c116a85 --- /dev/null +++ b/data/2024/aaai/Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning @@ -0,0 +1 @@ +In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster. \ No newline at end of file diff --git a/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision b/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision new file mode 100644 index 0000000000..ff9debf5e9 --- /dev/null +++ b/data/2024/aaai/Towards Epistemic-Doxastic Planning with Observation and Revision @@ -0,0 +1 @@ +Epistemic planning is useful in situations where multiple agents have different knowledge and beliefs about the world, such as in robot-human interaction. One aspect that has been largely neglected in the literature is planning with observations in the presence of false beliefs. This is a particularly challenging problem because it requires belief revision. We introduce a simple specification language for reasoning about actions with knowledge and belief. We demonstrate our approach on well-known false-belief tasks such as the Sally-Anne Task and compare it to other action languages. Our logic leads to an epistemic planning formalism that is expressive enough to model second-order false-belief tasks, yet has the same computational complexity as classical planning. \ No newline at end of file diff --git a/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality b/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality new file mode 100644 index 0000000000..e89a33b095 --- /dev/null +++ b/data/2024/aaai/Towards Equipping Transformer with the Ability of Systematic Compositionality @@ -0,0 +1 @@ +One of the key factors in language productivity and human cognition is the ability of Systematic Compositionality, which refers to understanding composed, unseen examples of seen primitives. However, recent evidence reveals that the Transformers have difficulty in generalizing the composed context based on the seen primitives. To this end, we take the first step to propose a compositionality-aware Transformer called CAT and two novel pre-training tasks to facilitate the systematic compositionality. We tentatively provide a successful implementation of a multi-layer CAT on the basis of the especially popular BERT. The experimental results demonstrate that CAT outperforms baselines on compositionality-aware tasks with minimal impact on effectiveness on standardized language understanding tasks. \ No newline at end of file diff --git a/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection b/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection new file mode 100644 index 0000000000..b1df3e2818 --- /dev/null +++ b/data/2024/aaai/Towards Evidential and Class Separable Open Set Object Detection @@ -0,0 +1 @@ +Detecting in open-world scenarios poses a formidable challenge for models intended for real-world deployment. The advanced closed set object detectors achieve impressive performance under the closed set setting, but often produce overconfident misprediction on unknown objects due to the lack of supervision. In this paper, we propose a novel Evidential Object Detector (EOD) to formulate the Open Set Object Detection (OSOD) problem from the perspective of Evidential Deep Learning (EDL) theory, which quantifies classification uncertainty by placing the Dirichlet Prior over the categorical distribution parameters. The task-specific customized evidential framework, equipped with meticulously designed model architecture and loss function, effectively bridges the gap between EDL theory and detection tasks. Moreover, we utilize contrastive learning as an implicit means of evidential regularization and to encourage the class separation in the latent space. Alongside, we innovatively model the background uncertainty to further improve the unknown discovery ability. Extensive experiments on benchmark datasets demonstrate the outperformance of the proposed method over existing ones. \ No newline at end of file diff --git a/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling b/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling new file mode 100644 index 0000000000..8657fdb5de --- /dev/null +++ b/data/2024/aaai/Towards Explainable Joint Models via Information Theory for Multiple Intent Detection and Slot Filling @@ -0,0 +1 @@ +Recent joint models for multi-intent detection and slot filling have obtained promising results through modeling the unidirectional or bidirectional guidance between intent and slot. However, existing works design joint models heuristically and lack some theoretical exploration, including (1) theoretical measurement of the joint-interaction quality; (2) explainability of design and optimization methods of joint models, which may limit the performance and efficiency of designs. In this paper, we mathematically define the cross-task information gain (CIG) to measure the quality of joint processes from an information-theoretic perspective and discover an implicit optimization of CIG in previous models. Based on this, we propose a novel multi-stage iterative framework with theoretical effectiveness, explainability, and convergence, which can explicitly optimize information for cross-task interactions. Further, we devise an information-based joint model (InfoJoint) that conforms to this theoretical framework to gradually reduce the cross-task propagation of erroneous semantics through CIG iterative maximization. Extensive experiment results on two public datasets show that InfoJoint outperforms the state-of-the-art models by a large margin. \ No newline at end of file diff --git a/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms b/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms new file mode 100644 index 0000000000..901d7d5b63 --- /dev/null +++ b/data/2024/aaai/Towards Fair Graph Federated Learning via Incentive Mechanisms @@ -0,0 +1,2 @@ +Graph federated learning (FL) has emerged as a pivotal paradigm enabling multiple agents to collaboratively train a graph model while preserving local data privacy. Yet, current efforts overlook a key issue: agents are self-interested and would hesitant to share data without fair and satisfactory incentives. This paper is the first endeavor to address this issue by studying the incentive mechanism for graph federated learning. We identify a unique phenomenon in graph federated learning: the presence of agents posing potential harm to the federation and agents contributing with delays. This stands in contrast to previous FL incentive mechanisms that assume all agents contribute positively and in a timely manner. +In view of this, this paper presents a novel incentive mechanism tailored for fair graph federated learning, integrating incentives derived from both model gradient and payoff. To achieve this, we first introduce an agent valuation function aimed at quantifying agent contributions through the introduction of two criteria: gradient alignment and graph diversity. Moreover, due to the high heterogeneity in graph federated learning, striking a balance between accuracy and fairness becomes particularly crucial. We introduce motif prototypes to enhance accuracy, communicated between the server and agents, enhancing global model aggregation and aiding agents in local model optimization. Extensive experiments show that our model achieves the best trade-off between accuracy and the fairness of model gradient, as well as superior payoff fairness. \ No newline at end of file diff --git a/data/2024/aaai/Towards Fairer Centroids in K-means Clustering b/data/2024/aaai/Towards Fairer Centroids in K-means Clustering new file mode 100644 index 0000000000..f83632eae9 --- /dev/null +++ b/data/2024/aaai/Towards Fairer Centroids in K-means Clustering @@ -0,0 +1 @@ +There has been much recent interest in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and sex. Within the centroid clustering paradigm, these algorithms are seen to generate clusterings where different groups are disadvantaged within different clusters with respect to their representativity, i.e., distance to centroid. In view of this deficiency, we propose a novel notion of cluster-level centroid fairness that targets the representativity unfairness borne by groups within each cluster, along with a metric to quantify the same. Towards operationalising this notion, we draw on ideas from political philosophy aligned with consideration for the worst-off group to develop Fair-Centroid; a new clustering method that focusses on enhancing the representativity of the worst-off group within each cluster. Our method uses an iterative optimisation paradigm wherein an initial cluster assignment is refined by reassigning objects to clusters such that the worst-off group in each cluster is benefitted. We compare our notion with a related fairness notion and show through extensive empirical evaluations on real-world datasets that our method significantly enhances cluster-level centroid fairness at low impact on cluster coherence. \ No newline at end of file diff --git a/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery b/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery new file mode 100644 index 0000000000..56ead50a57 --- /dev/null +++ b/data/2024/aaai/Towards Fairness in Online Service with K Servers and Its Application on Fair Food Delivery @@ -0,0 +1 @@ +The k-SERVER problem is one of the most prominent problems in online algorithms with several variants and extensions. However, simplifying assumptions like instantaneous server movements and zero service time has hitherto limited its applicability to real-world problems. In this paper, we introduce a realistic generalization of k-SERVER without such assumptions – the k-FOOD problem, where requests with source-destination locations and an associated pickup time window arrive in an online fashion, and each has to be served by exactly one of the available k servers. The k-FOOD problem offers the versatility to model a variety of real-world use cases such as food delivery, ride sharing, and quick commerce. Moreover, motivated by the need for fairness in online platforms, we introduce the FAIR k-FOOD problem with the max-min objective. We establish that both k-FOOD and FAIR k-FOOD problems are strongly NP-hard and develop an optimal offline algorithm that arises naturally from a time-expanded flow network. Subsequently, we propose an online algorithm DOC4FOOD involving virtual movements of servers to the nearest request location. Experiments on a real-world food-delivery dataset, alongside synthetic datasets, establish the efficacy of the proposed algorithm against state-of-the-art fair food delivery algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing b/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing new file mode 100644 index 0000000000..e5f2ef7c95 --- /dev/null +++ b/data/2024/aaai/Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing @@ -0,0 +1 @@ +Human body orientation estimation (HBOE) aims to estimate the orientation of a human body relative to the camera’s frontal view. Despite recent advancements in this field, there still exist limitations in achieving fine-grained results. We identify certain defects and propose corresponding approaches as follows: 1). Existing datasets suffer from non-uniform angle distributions, resulting in sparse image data for certain angles. To provide comprehensive and high-quality data, we introduce RMOS (Rendered Model Orientation Set), a rendered dataset comprising 150K accurately labeled human instances with a wide range of orientations. 2). Directly using one-hot vector as labels may overlook the similarity between angle labels, leading to poor supervision. And converting the predictions from radians to degrees enlarges the regression error. To enhance supervision, we employ Laplace smoothing to vectorize the label, which contains more information. For fine-grained predictions, we adopt weighted Smooth-L1-loss to align predictions with the smoothed-label, thus providing robust supervision. 3). Previous works ignore body-part-specific information, resulting in coarse predictions. By employing local-window self-attention, our model could utilize different body part information for more precise orientation estimations. We validate the effectiveness of our method in the benchmarks with extensive experiments and show that our method outperforms state-of-the-art. Project is available at: https://github.com/Whalesong-zrs/Towards-Fine-grained-HBOE. \ No newline at end of file diff --git a/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems b/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems new file mode 100644 index 0000000000..b2f629534f --- /dev/null +++ b/data/2024/aaai/Towards Holistic, Pragmatic and Multimodal Conversational Systems @@ -0,0 +1 @@ +Language acquisition and utilization transcend the mere exchange of lexical units. Visual cues, prosody, gestures, body movements, and context play an undeniably crucial role. Humans naturally communicate multimodally, employing multiple channels and synthesizing information from diverse modalities. My research delves into the characterization and construction of multimodal models that seamlessly integrate data from multiple independent modalities. I will cover recent work that highlights the challenges, achievements, and opportunities towards developing capable multimodal discursive models. \ No newline at end of file diff --git a/data/2024/aaai/Towards Human-like Learning from Relational Structured Data b/data/2024/aaai/Towards Human-like Learning from Relational Structured Data new file mode 100644 index 0000000000..ef355ec653 --- /dev/null +++ b/data/2024/aaai/Towards Human-like Learning from Relational Structured Data @@ -0,0 +1 @@ +Relational structured data is a way of representing knowledge using nodes and edges, while also capturing the meaning of that knowledge in a structured form that can be used for machine learning. Compared with vision and natural language data, relational structured data represents and manipulates structured knowledge, which can be beneficial for tasks that involve reasoning or inference. On the other hand, vision and NLP deal more with unstructured data (like images and text), and they often require different types of models and algorithms to extract useful information or features from the data. Human-like Learning develops methods that can harness relational structures and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. With Human-like Learning, the learning algorithm is efficient and can adapt to new or unseen situations, which is crucial in real-world applications where environments may change unpredictably. Moreover, the models are easier for humans to understand and interpret, which is important for transparency and trust in AI systems. In this talk, we present our recent attempts towards human-like learning from relational structured data. \ No newline at end of file diff --git a/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization b/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization new file mode 100644 index 0000000000..d1f95379ec --- /dev/null +++ b/data/2024/aaai/Towards Large Certified Radius in Randomized Smoothing Using Quasiconcave Optimization @@ -0,0 +1 @@ +Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, due to its excessively conservative nature, this method of incomplete verification often cannot achieve an adequate certified radius on real-world datasets. One way to obtain a larger certified radius is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on CIFAR-10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead. \ No newline at end of file diff --git a/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks b/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks new file mode 100644 index 0000000000..60a2e6d1a9 --- /dev/null +++ b/data/2024/aaai/Towards Learning and Explaining Indirect Causal Effects in Neural Networks @@ -0,0 +1 @@ +Recently, there has been a growing interest in learning and explaining causal effects within Neural Network (NN) models. By virtue of NN architectures, previous approaches consider only direct and total causal effects assuming independence among input variables. We view an NN as a structural causal model (SCM) and extend our focus to include indirect causal effects by introducing feedforward connections among input neurons. We propose an ante-hoc method that captures and maintains direct, indirect, and total causal effects during NN model training. We also propose an algorithm for quantifying learned causal effects in an NN model and efficient approximation strategies for quantifying causal effects in high-dimensional data. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the causal effects learned by our ante-hoc method better approximate the ground truth effects compared to existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable b/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable new file mode 100644 index 0000000000..5e078ee6d4 --- /dev/null +++ b/data/2024/aaai/Towards Making Learnware Specification and Market Evolvable @@ -0,0 +1 @@ +The learnware paradigm aims to establish a market of numerous well-performed machine learning models, enabling users to leverage existing helpful models for their tasks instead of starting from scratch. Each learnware in the market is a model submitted by its developer, associated with a specification generated with the help of learnware market, representing the model's specialty and utility and enabling it to be identified for new user tasks. As the market continuously scales up, accommodating an ever-increasing number of learnwares, the critical challenge of the learnware paradigm is to effectively and efficiently identify the most helpful learnware(s) for a new user task without accessing the user's raw data. In this paper, to achieve increasingly accurate learnware characterization and identification along with a growing number of learnwares in the market, we propose an approach called Evolvable Learnware Specification with Index (ELSI). Specifically, based on the key idea of leveraging the task information within learnware specifications, we tackle the challenge of ascertaining the capabilities of models beyond their original training tasks, thereby enabling learnware specifications and the entire market to evolve continuously. Furthermore, through organizing learnwares and constructing specification indexes, we design a practical procedure to accurately and efficiently identify helpful learnwares without examining the entire market. Theoretical analysis and extensive experiments on a learnware market prototype encompassing thousands of models and covering six real-world scenarios validate the effectiveness and efficiency of our approach. \ No newline at end of file diff --git a/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation b/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation new file mode 100644 index 0000000000..845d110de0 --- /dev/null +++ b/data/2024/aaai/Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation @@ -0,0 +1 @@ +Model extraction attacks (MEAs) enable an attacker to replicate the functionality of a victim deep neural network (DNN) model by only querying its API service remotely, posing a severe threat to the security and integrity of pay-per-query DNN-based services. Although the majority of current research on MEAs has primarily concentrated on neural classifiers, there is a growing prevalence of image-to-image translation (I2IT) tasks in our everyday activities. However, techniques developed for MEA of DNN classifiers cannot be directly transferred to the case of I2IT, rendering the vulnerability of I2IT models to MEA attacks often underestimated. This paper unveils the threat of MEA in I2IT tasks from a new perspective. Diverging from the traditional approach of bridging the distribution gap between attacker queries and victim training samples, we opt to mitigate the effect caused by the different distributions, known as the domain shift. This is achieved by introducing a new regularization term that penalizes high-frequency noise, and seeking a flatter minimum to avoid overfitting to the shifted distribution. Extensive experiments on different image translation tasks, including image super-resolution and style transfer, are performed on different backbone victim models, and the new design consistently outperforms the baseline by a large margin across all metrics. A few real-life I2IT APIs are also verified to be extremely vulnerable to our attack, emphasizing the need for enhanced defenses and potentially revised API publishing policies. \ No newline at end of file diff --git a/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction b/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction new file mode 100644 index 0000000000..62bbdaaa7e --- /dev/null +++ b/data/2024/aaai/Towards Modeling Uncertainties of Self-Explaining Neural Networks via Conformal Prediction @@ -0,0 +1 @@ +Despite the recent progress in deep neural networks (DNNs), it remains challenging to explain the predictions made by DNNs. Existing explanation methods for DNNs mainly focus on post-hoc explanations where another explanatory model is employed to provide explanations. The fact that post-hoc methods can fail to reveal the actual original reasoning process of DNNs raises the need to build DNNs with built-in interpretability. Motivated by this, many self-explaining neural networks have been proposed to generate not only accurate predictions but also clear and intuitive insights into why a particular decision was made. However, existing self-explaining networks are limited in providing distribution-free uncertainty quantification for the two simultaneously generated prediction outcomes (i.e., a sample's final prediction and its corresponding explanations for interpreting that prediction). Importantly, they also fail to establish a connection between the confidence values assigned to the generated explanations in the interpretation layer and those allocated to the final predictions in the ultimate prediction layer. To tackle the aforementioned challenges, in this paper, we design a novel uncertainty modeling framework for self-explaining networks, which not only demonstrates strong distribution-free uncertainty modeling performance for the generated explanations in the interpretation layer but also excels in producing efficient and effective prediction sets for the final predictions based on the informative high-level basis explanations. We perform the theoretical analysis for the proposed framework. Extensive experimental evaluation demonstrates the effectiveness of the proposed uncertainty framework. \ No newline at end of file diff --git a/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA b/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA new file mode 100644 index 0000000000..c4bf22aa17 --- /dev/null +++ b/data/2024/aaai/Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA @@ -0,0 +1 @@ +Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised Multi-level Contrastive Learning based natural language Explanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport b/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport new file mode 100644 index 0000000000..77be97afc3 --- /dev/null +++ b/data/2024/aaai/Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport @@ -0,0 +1 @@ +Multi-Intent spoken language understanding (SLU) can handle complicated utterances expressing multiple intents, which has attracted increasing attention from researchers. Although existing models have achieved promising performance, most of them still suffer from two leading problems: (1) each intent has its specific scope and the semantic information outside the scope might potentially hinder accurate predictions, i.e. scope barrier; (2) only the guidance from intent to slot is modeled but the guidance from slot to intent is often neglected, i.e. unidirectional guidance. In this paper, we propose a novel Multi-Intent SLU framework termed HAOT, which utilizes hierarchical attention to divide the scopes of each intent and applies optimal transport to achieve the mutual guidance between slot and intent. Experiments demonstrate that our model achieves state-of-the-art performance on two public Multi-Intent SLU datasets, obtaining the 3.4 improvement on MixATIS dataset compared to the previous best models in overall accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition b/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition new file mode 100644 index 0000000000..c72203c7f9 --- /dev/null +++ b/data/2024/aaai/Towards Multi-Mode Outlier Robust Tensor Ring Decomposition @@ -0,0 +1 @@ +Conventional Outlier Robust Tensor Decomposition (ORTD) approaches generally represent sparse outlier corruption within a specific mode. However, such an assumption, which may hold for matrices, proves inadequate when applied to high-order tensors. In the tensor domain, the outliers are prone to be corrupted in multiple modes simultaneously. Addressing this limitation, this study proposes a novel ORTD approach by recovering low-rank tensors contaminated by outliers spanning multiple modes. In particular, we conceptualize outliers within high-order tensors as latent tensor group sparsity by decomposing the corrupted tensor into a sum of multiple latent components, where each latent component is exclusive to outliers within a particular direction. Thus, it can effectively mitigate the outlier corruptions prevalent in high-order tensors across multiple modes. To theoretically guarantee recovery performance, we rigorously analyze a non-asymptotic upper bound of the estimation error for the proposed ORTD approach. In the optimization process, we develop an efficient alternate direction method of multipliers (ADMM) algorithm. Empirical validation of the approach's efficacy is undertaken through comprehensive experimentation. \ No newline at end of file diff --git a/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization b/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization new file mode 100644 index 0000000000..b1eb9a5dad --- /dev/null +++ b/data/2024/aaai/Towards Real-World Test-Time Adaptation: Tri-net Self-Training with Balanced Normalization @@ -0,0 +1,2 @@ +Test-Time Adaptation aims to adapt source domain model to testing data at inference stage with success demonstrated in adapting to unseen corruptions. However, these attempts may fail under more challenging real-world scenarios. Existing works mainly consider real-world test-time adaptation under non-i.i.d. data stream and continual domain shift. In this work, we first complement the existing real-world TTA protocol with a globally class imbalanced testing set. We demonstrate that combining all settings together poses new challenges to existing methods. We argue the failure of state-of-the-art methods is first caused by indiscriminately adapting normalization layers to imbalanced testing data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap out the regular batchnorm at inference stage. The new batchnorm layer is capable of adapting without biasing towards majority classes. We are further inspired by the success of self-training (ST) in learning from unlabeled data and adapt ST for test-time adaptation. However, ST alone is prone to over adaption which is responsible for the poor performance under continual domain shift. Hence, we propose to improve self-training under continual domain shift by regularizing model updates with an anchored loss. The final TTA model, termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm layers. We evaluate TRIBE on four datasets representing real-world TTA settings. TRIBE consistently achieves the state-of-the-art performance across multiple evaluation protocols. +The code is available at https://github.com/Gorilla-Lab-SCUT/TRIBE. \ No newline at end of file diff --git a/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation b/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation new file mode 100644 index 0000000000..a57e36c02c --- /dev/null +++ b/data/2024/aaai/Towards Reliable Learning in the Wild: Generalization and Adaptation @@ -0,0 +1 @@ +The real-world deployment of machine learning algorithms often poses challenges due to shifts in data distributions and tasks. These shifts can lead to a degradation in model performance, as the model may not have encountered such changes during training. Additionally, they can make it difficult for the model to generalize to new scenarios and can result in poor performance in real-world applications. In this talk, I will present our research on building machine learning models that are highly generalizable and easily adaptable to different shifts. Specifically, I will first discuss our approach to improving out-of-distribution robustness and mitigating spurious correlations by training environment-invariant models through selective augmentation and post-hoc rectification. Second, I will present our techniques for continuous and rapid adaptation of models to new tasks and environments. This includes methods to facilitate compositional generalization and adaptation by extracting relationships from historical observations and to enhance reliable adaptation even in the face of imperfect observations. Additionally, I will showcase our successful practices for addressing shifts in real-world applications, such as in the healthcare, e-commerce, and transportation industries. The talk will also touch upon the remaining challenges and outline future research directions in this area. \ No newline at end of file diff --git a/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection b/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection new file mode 100644 index 0000000000..4eba02e9b4 --- /dev/null +++ b/data/2024/aaai/Towards Reproducible, Automated, and Scalable Anomaly Detection @@ -0,0 +1 @@ +Anomaly detection (AD), often termed outlier detection, is a key machine learning (ML) task, aiming to identify uncommon yet crucial patterns in data. With the increasing complexity of the modern world, the applications of AD span wide—from NASA's spacecraft monitoring to early patient prioritization at University of Pittsburgh Medical Center. Technology giants like Google and Amazon also leverage AD for service disruption identification. Here, I will traverse my AD works with promising new directions, particularly emphasizing reproducible benchmarks (Part 1), automated algorithms (Part 2), and scalable systems (Part 3). \ No newline at end of file diff --git a/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks b/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks new file mode 100644 index 0000000000..cb020fd040 --- /dev/null +++ b/data/2024/aaai/Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks @@ -0,0 +1 @@ +Image stitching seamlessly integrates images captured from varying perspectives into a single wide field-of-view image. Such integration not only broadens the captured scene but also augments holistic perception in computer vision applications. Given a pair of captured images, subtle perturbations and distortions which go unnoticed by the human visual system tend to attack the correspondence matching, impairing the performance of image stitching algorithms. In light of this challenge, this paper presents the first attempt to improve the robustness of image stitching against adversarial attacks. Specifically, we introduce a stitching-oriented attack (SoA), tailored to amplify the alignment loss within overlapping regions, thereby targeting the feature matching procedure. To establish an attack resistant model, we delve into the robustness of stitching architecture and develop an adaptive adversarial training (AAT) to balance attack resistance with stitching precision. In this way, we relieve the gap between the routine adversarial training and benign models, ensuring resilience without quality compromise. Comprehensive evaluation across real-world and synthetic datasets validate the deterioration of SoA on stitching performance. Furthermore, AAT emerges as a more robust solution against adversarial perturbations, delivering superior stitching results. Code is available at: https://github.com/Jzy2017/TRIS. \ No newline at end of file diff --git a/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning b/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning new file mode 100644 index 0000000000..69c6581a8b --- /dev/null +++ b/data/2024/aaai/Towards Robust Visual Understanding: from Recognition to Reasoning @@ -0,0 +1 @@ +Models that learn from data are widely and rapidly being deployed today for real-world use, but they suffer from unforeseen failures due to distribution shift, adversarial attacks, noise and corruption, and data scarcity. But many failures also occur because many modern AI tasks require reasoning beyond pattern matching -- and such reasoning abilities are difficult to formulate as data-based input-output function fitting. The reliability problem has become increasingly important under the new paradigm of semantic ``multimodal'' learning. My research provides avenues to develop robust and reliable computer vision systems, particularly by leveraging the interactions between vision and language. In this AAAI New Faculty highlights talk, I will cover three thematic areas of my research, ranging from robustness in computer vision, open-domain reliability in visual reasoning, and challenges and opportunities in evaluation of generative models. Readers are encouraged to refer to my website (www.tejasgokhale.com) for more details and updates from my lab's activities towards the goal of robust visual understanding. \ No newline at end of file diff --git a/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract) b/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract) new file mode 100644 index 0000000000..335a074efb --- /dev/null +++ b/data/2024/aaai/Towards Robustness to Natural Variations and Distribution Shift (Student Abstract) @@ -0,0 +1 @@ +This research focuses on improving the robustness of machine learning systems to natural variations and distribution shifts. A design trade space is presented, and various methods are compared, including adversarial training, data augmentation techniques, and novel approaches inspired by model-based robust optimization formulations. \ No newline at end of file diff --git a/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms b/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms new file mode 100644 index 0000000000..46a1d14733 --- /dev/null +++ b/data/2024/aaai/Towards Running Time Analysis of Interactive Multi-Objective Evolutionary Algorithms @@ -0,0 +1 @@ +Evolutionary algorithms (EAs) are widely used for multi-objective optimization due to their population-based nature. Traditional multi-objective EAs (MOEAs) generate a large set of solutions to approximate the Pareto front, leaving a decision maker (DM) with the task of selecting a preferred solution. However, this process can be inefficient and time-consuming, especially when there are many objectives or the DM has subjective preferences. To address this issue, interactive MOEAs (iMOEAs) combine decision making into the optimization process, i.e., update the population with the help of the DM. In contrast to their wide applications, there has existed only two pieces of theoretical works on iMOEAs, which only considered interactive variants of the two simple single-objective algorithms, RLS and (1+1)-EA. This paper provides the first running time analysis (the essential theoretical aspect of EAs) for practical iMOEAs. Specifically, we prove that the expected running time of the well-developed interactive NSGA-II (called R-NSGA-II) for solving the OneMinMax, OneJumpZeroJump problems are all asymptotically faster than the traditional NSGA-II. Meanwhile, we present a variant of OneMinMax, and prove that R-NSGA-II can be exponentially slower than NSGA-II. These results provide theoretical justification for the effectiveness of iMOEAs while identifying situations where they may fail. Experiments are also conducted to validate the theoretical results. \ No newline at end of file diff --git a/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach b/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach new file mode 100644 index 0000000000..cffcc220fc --- /dev/null +++ b/data/2024/aaai/Towards Safe Policy Learning under Partial Identifiability: A Causal Approach @@ -0,0 +1 @@ +Learning personalized treatment policies is a formative challenge in many real-world applications, including in healthcare, econometrics, artificial intelligence. However, the effectiveness of candidate policies is not always identifiable, i.e., it is not uniquely computable from the combination of the available data and assumptions about the generating mechanisms. This paper studies policy learning from data collected in various non-identifiable settings, i.e., (1) observational studies with unobserved confounding; (2) randomized experiments with partial observability; and (3) their combinations. We derive sharp, closed-formed bounds from observational and experimental data over the conditional treatment effects. Based on these novel bounds, we further characterize the problem of safe policy learning and develop an algorithm that trains a policy from data guaranteed to achieve, at least, the performance of the baseline policy currently deployed. Finally, we validate our proposed algorithm on synthetic data and a large clinical trial, demonstrating that it guarantees safe behaviors and robust performance. \ No newline at end of file diff --git a/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation b/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation new file mode 100644 index 0000000000..a87d40e42b --- /dev/null +++ b/data/2024/aaai/Towards Squeezing-Averse Virtual Try-On via Sequential Deformation @@ -0,0 +1 @@ +In this paper, we first investigate a visual quality degradation problem observed in recent high-resolution virtual try-on approach. The tendency is empirically found that the textures of clothes are squeezed at the sleeve, as visualized in the upper row of Fig.1(a). A main reason for the issue arises from a gradient conflict between two popular losses, the Total Variation (TV) and adversarial losses. Specifically, the TV loss aims to disconnect boundaries between the sleeve and torso in a warped clothing mask, whereas the adversarial loss aims to combine between them. Such contrary objectives feedback the misaligned gradients to a cascaded appearance flow estimation, resulting in undesirable squeezing artifacts. To reduce this, we propose a Sequential Deformation (SD-VITON) that disentangles the appearance flow prediction layers into TV objective-dominant (TVOB) layers and a task-coexistence (TACO) layer. Specifically, we coarsely fit the clothes onto a human body via the TVOB layers, and then keep on refining via the TACO layer. In addition, the bottom row of Fig.1(a) shows a different type of squeezing artifacts around the waist. To address it, we further propose that we first warp the clothes into a tucked-out shirts style, and then partially erase the texture from the warped clothes without hurting the smoothness of the appearance flows. Experimental results show that our SD-VITON successfully resolves both types of artifacts and outperforms the baseline methods. Source code will be available at https://github.com/SHShim0513/SD-VITON. \ No newline at end of file diff --git a/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent b/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent new file mode 100644 index 0000000000..b7b11a1738 --- /dev/null +++ b/data/2024/aaai/Towards Stability and Generalization Bounds in Decentralized Minibatch Stochastic Gradient Descent @@ -0,0 +1 @@ +Decentralized Stochastic Gradient Descent (D-SGD) represents an efficient communication approach tailored for mastering insights from vast, distributed datasets. Inspired by parallel optimization paradigms, the incorporation of minibatch serves to diminish variance, consequently expediting the optimization process. Nevertheless, as per our current understanding, the existing literature has not thoroughly explored the learning theory foundation of Decentralized Minibatch Stochastic Gradient Descent (DM-SGD). In this paper, we try to address this theoretical gap by investigating the generalization properties of DM-SGD. We establish the sharper generalization bounds for the DM-SGD algorithm with replacement (without replacement) on (non)convex and (non)smooth cases. Moreover, our results consistently recover to the results of Centralized Stochastic Gradient Descent (C-SGD). In addition, we derive generalization analysis for Zero-Order (ZO) version of DM-SGD. \ No newline at end of file diff --git a/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation b/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation new file mode 100644 index 0000000000..2acd56a885 --- /dev/null +++ b/data/2024/aaai/Towards Transferable Adversarial Attacks with Centralized Perturbation @@ -0,0 +1 @@ +Adversarial transferability enables black-box attacks on unknown victim deep neural networks (DNNs), rendering attacks viable in real-world scenarios. Current transferable attacks create adversarial perturbation over the entire image, resulting in excessive noise that overfit the source model. Concentrating perturbation to dominant image regions that are model-agnostic is crucial to improving adversarial efficacy. However, limiting perturbation to local regions in the spatial domain proves inadequate in augmenting transferability. To this end, we propose a transferable adversarial attack with fine-grained perturbation optimization in the frequency domain, creating centralized perturbation. We devise a systematic pipeline to dynamically constrain perturbation optimization to dominant frequency coefficients. The constraint is optimized in parallel at each iteration, ensuring the directional alignment of perturbation optimization with model prediction. Our approach allows us to centralize perturbation towards sample-specific important frequency features, which are shared by DNNs, effectively mitigating source model overfitting. Experiments demonstrate that by dynamically centralizing perturbation on dominating frequency coefficients, crafted adversarial examples exhibit stronger transferability, and allowing them to bypass various defenses. \ No newline at end of file diff --git a/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations b/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations new file mode 100644 index 0000000000..679769456d --- /dev/null +++ b/data/2024/aaai/Towards Trustworthy Autonomous Systems via Conversations and Explanations @@ -0,0 +1 @@ +Autonomous systems fulfil an increasingly important role in our societies, however, AI-powered systems have seen less success over the years, as they are expected to tackle a range of social, legal, or technological challenges and modern neural network-based AI systems cannot yet provide guarantees to many of these challenges. Particularly important is that these systems are black box decision makers, eroding human oversight, contestation, and agency. To address this particular concern, my thesis focuses on integrating social explainable AI with cognitive methods and natural language processing to shed light on the internal processes of autonomous systems in a way accessible to lay users. I propose a causal explanation generation model for decision-making called CEMA based on counterfactual simulations in multi-agent systems. I also plan to integrate CEMA with a broader natural language processing pipeline to support targeted and personalised explanations that address people's cognitive biases. I hope that my research will have a positive impact on the public acceptance of autonomous agents by building towards more trustworthy AI. \ No newline at end of file diff --git a/data/2024/aaai/Towards Trustworthy Deep Learning b/data/2024/aaai/Towards Trustworthy Deep Learning new file mode 100644 index 0000000000..2223dd605f --- /dev/null +++ b/data/2024/aaai/Towards Trustworthy Deep Learning @@ -0,0 +1,7 @@ +Deep neural networks (DNNs) have achieved unprecedented success across many scientific and engineering fields in the last decades. Despite its empirical success, unfortunately, recent studies have shown that there are various failure modes and blindspots in DNN models which may result in unexpected serious failures and potential harms, e.g. the existence of adversarial examples and small perturbations. This is not acceptable especially for safety critical and high stakes applications in the real-world, including healthcare, self-driving cars, aircraft control systems, hiring and malware detection protocols. Moreover, it has been challenging to understand why and when DNNs will fail due to their complicated structures and black-box behaviors. Lacking interpretability is one critical issue that may seriously hinder the deployment of DNNs in high-stake applications, which need interpretability to trust the prediction, to understand potential failures, and to be able to mitigate harms and eliminate biases in the model. + + +To make DNNs trustworthy and reliable for deployment, it is necessary and urgent to develop methods and tools that can (i) quantify and improve their robustness against adversarial and natural perturbations, and (ii) understand their underlying behaviors and further correct errors to prevent injuries and damages. These are the important first steps to enable Trustworthy AI and Trustworthy Machine Learning. In this talk, I will survey a series of research efforts in my lab contributed to tackling the grand challenges in (i) and (ii). In the first part of my talk, I will overview our research effort in Robust Machine Learning since 2017, where we have proposed the first attack-agnostic robustness evaluation metric, the first efficient robustness certification algorithms for various types of perturbations, and efficient robust learning algorithms across supervised learning to deep reinforcement learning. + + +In the second part of my talk, I will survey a series of exciting results in my lab on accelerating interpretable machine learning and explainable AI. Specifically, I will show how we could bring interpretability into deep learning by leveraging recent advances in multi-modal models. I'll present recent works in our group on automatically dissecting neural networks with open vocabulary concepts, designing interpretable neural networks without concept labels, and briefly overview our recent efforts on demystifying black-box DNN training process, automated neuron explanations for Large Language Models and the first robustness evaluation of a family of neuron-level interpretation techniques. \ No newline at end of file diff --git a/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation b/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation new file mode 100644 index 0000000000..6547f5f913 --- /dev/null +++ b/data/2024/aaai/Towards Understanding Future: Consistency Guided Probabilistic Modeling for Action Anticipation @@ -0,0 +1,7 @@ +Action anticipation aims to infer the action in the unobserved segment (future segment) with the observed segment (past segment). +Existing methods focus on learning key past semantics to predict the future, but they do not model the temporal continuity between the past and the future. However, past actions are always highly uncertain in anticipating the unobserved future. +The absence of temporal continuity smoothing in the video's past-and-future segments may result in an inconsistent anticipation of future action. +In this work, we aim to smooth the global semantics changes in the past and future segments. We propose a Consistency-guided Probabilistic Model (CPM), which focuses on learning the globally temporal probabilistic consistency to inhibit the unexpected temporal consistency. +The CPM is deployed on the Transformer architecture, which includes three modules of future semantics estimation, global semantics estimation, and global distribution estimation involving the learning of past-to-future semantics, past-and-future semantics, and semantically probabilistic distributions. +To achieve the smoothness of temporal continuity, we follow the principle of variational analysis and describe two probabilistic distributions, i.e., a past-aware distribution and a global-aware distribution, which help to estimate the evidence lower bound of future anticipation. +In this study, we maximize the evidence lower bound of future semantics by reducing the distribution distance between the above two distributions for model optimization. Extensive experiments demonstrate that the effectiveness of our method and the CPM achieves state-of-the-art performance on Epic-Kitchen100, Epic-Kitchen55, and EGTEA-GAZE. \ No newline at end of file diff --git a/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice b/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice new file mode 100644 index 0000000000..0edcb1d3bf --- /dev/null +++ b/data/2024/aaai/Towards a More Burkean Approach to Computational Social Choice @@ -0,0 +1 @@ +In the last few years, a lot of the activity of the computational social choice community has focused on novel mechanisms for reaching decisions by large groups of people. While this research makes meaningful scientific contributions, many of these mechanisms are not quite useful in realistic decision-making settings. Moreover, their radicalism ignores the centuries-old experience we have with large-scale human decision-making, and what it teaches us about what works. We believe it is important the community engage with mechanisms which are widely-used in the real world, as they may hold a key to a deeper understanding of how people reach decisions and the way that helps them do that productively. Moreover, letting the community bring its analysis and understanding to these will allow for algorithmic suggestions that have some chance of being implemented (and, thus, can contribute to the public debate on these topics). In particular, we highlight the relatively less-investigated role of parties and grouping of voters and candidates, and the role of executive capacity in analyzing decision-making structures. \ No newline at end of file diff --git a/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation b/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation new file mode 100644 index 0000000000..bc509ed872 --- /dev/null +++ b/data/2024/aaai/Towards a Theoretical Understanding of Why Local Search Works for Clustering with Fair-Center Representation @@ -0,0 +1,4 @@ +The representative k-median problem generalizes the classical clustering formulations in that it partitions the data points into several disjoint demographic groups and poses a lower-bound constraint on the number of opened facilities from each group, such that all the groups are fairly represented by the opened facilities. Due to its simplicity, the local-search heuristic that optimizes an initial solution by iteratively swapping at most a constant number of closed facilities for the same number of opened ones (denoted by the O(1)-swap heuristic) has been frequently used in the representative k-median problem. Unfortunately, despite its good performance exhibited in experiments, whether the O(1)-swap heuristic has provable approximation guarantees for the case where the number of groups is more than 2 remains an open question for a long time. As an answer to this question, we show that the O(1)-swap heuristic +(1) is guaranteed to yield a constant-factor approximation solution if the number of groups is a constant, and +(2) has an unbounded approximation ratio otherwise. +Our main technical contribution is a new approach for theoretically analyzing local-search heuristics, which derives the approximation ratio of the O(1)-swap heuristic via linearly combining the increased clustering costs induced by a set of hierarchically organized swaps. \ No newline at end of file diff --git a/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract) b/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract) new file mode 100644 index 0000000000..45eba6f9de --- /dev/null +++ b/data/2024/aaai/Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions (Student Abstract) @@ -0,0 +1 @@ +In the last years, several variants of transformers have emerged. In this paper, we compare different transformer-based models for solving the reverse dictionary task and explore their use in the context of a serious game called The Dictionary Game. \ No newline at end of file diff --git a/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series b/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series new file mode 100644 index 0000000000..33b819b654 --- /dev/null +++ b/data/2024/aaai/Towards the Disappearing Truth: Fine-Grained Joint Causal Influences Learning with Hidden Variable-Driven Causal Hypergraphs in Time Series @@ -0,0 +1 @@ +Causal discovery under Granger causality framework has yielded widespread concerns in time series analysis task. Nevertheless, most previous methods are unaware of the underlying causality disappearing problem, that is, certain weak causalities are less focusable and may be lost during the modeling process, thus leading to biased causal conclusions. Therefore, we propose to introduce joint causal influences (i.e., causal influences from the union of multiple variables) as additional causal indication information to help identify weak causalities. Further, to break the limitation of existing methods that implicitly and coarsely model joint causal influences, we propose a novel hidden variable-driven causal hypergraph neural network to meticulously explore the locality and diversity of joint causal influences, and realize its explicit and fine-grained modeling. Specifically, we introduce hidden variables to construct a causal hypergraph for explicitly characterizing various fine-grained joint causal influences. Then, we customize a dual causal information transfer mechanism (encompassing a multi-level causal path and an information aggregation path) to realize the free diffusion and meticulous aggregation of joint causal influences and facilitate its adaptive learning. Finally, we design a multi-view collaborative optimization constraint to guarantee the characterization diversity of causal hypergraph and capture remarkable forecasting relationships (i.e., causalities). Experiments are conducted to demonstrate the superiority of the proposed model. \ No newline at end of file diff --git a/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning b/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning new file mode 100644 index 0000000000..68118aa7ff --- /dev/null +++ b/data/2024/aaai/Towards the Robustness of Differentially Private Federated Learning @@ -0,0 +1 @@ +Robustness and privacy protection are two important factors of trustworthy federated learning (FL). Existing FL works usually secure data privacy by perturbing local model gradients via the differential privacy (DP) technique, or defend against poisoning attacks by filtering the local gradients in the outlier of the gradient distribution before aggregation. However, these two issues are often addressed independently in existing works, and how to secure federated learning in both privacy and robustness still needs further exploration. In this paper, we unveil that although DP noisy perturbation can improve the learning robustness, DP-FL frameworks are not inherently robust and are vulnerable to a carefully-designed attack method. Furthermore, we reveal that it is challenging for existing robust FL methods to defend against attacks on DP-FL. This can be attributed to the fact that the local gradients of DP-FL are perturbed by random noise, and the selected central gradients inevitably incorporate a higher proportion of poisoned gradients compared to conventional FL. To address this problem, we further propose a new defense method for DP-FL (named Robust-DPFL), which can effectively distinguish poisoned and clean local gradients in DP-FL and robustly update the global model. Experiments on three benchmark datasets demonstrate that baseline methods cannot ensure task accuracy, data privacy, and robustness simultaneously, while Robust-DPFL can effectively enhance the privacy protection and robustness of federated learning meanwhile maintain the task performance. \ No newline at end of file diff --git a/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution b/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution new file mode 100644 index 0000000000..eece5fe980 --- /dev/null +++ b/data/2024/aaai/TraceEvader: Making DeepFakes More Untraceable via Evading the Forgery Model Attribution @@ -0,0 +1 @@ +In recent few years, DeepFakes are posing serve threats and concerns to both individuals and celebrities, as realistic DeepFakes facilitate the spread of disinformation. Model attribution techniques aim at attributing the adopted forgery models of DeepFakes for provenance purposes and providing explainable results to DeepFake forensics. However, the existing model attribution techniques rely on the trace left in the DeepFake creation, which can become futile if such traces were disrupted. Motivated by our observation that certain traces served for model attribution appeared in both the high-frequency and low-frequency domains and play a divergent role in model attribution. In this work, for the first time, we propose a novel training-free evasion attack, TraceEvader, in the most practical non-box setting. Specifically, TraceEvader injects a universal imitated traces learned from wild DeepFakes into the high-frequency component and introduces adversarial blur into the domain of the low-frequency component, where the added distortion confuses the extraction of certain traces for model attribution. The comprehensive evaluation on 4 state-of-the-art (SOTA) model attribution techniques and fake images generated by 8 generative models including generative adversarial networks (GANs) and diffusion models (DMs) demonstrates the effectiveness of our method. Overall, our TraceEvader achieves the highest average attack success rate of 79% and is robust against image transformations and dedicated denoising techniques as well where the average attack success rate is still around 75%. Our TraceEvader confirms the limitations of current model attribution techniques and calls the attention of DeepFake researchers and practitioners for more robust-purpose model attribution techniques. \ No newline at end of file diff --git a/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability b/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability new file mode 100644 index 0000000000..0d1b61550c --- /dev/null +++ b/data/2024/aaai/Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability @@ -0,0 +1 @@ +Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning re-search, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets. Notably, this method has been readily employed for medical applications, such as X-ray image synthesis, leveraging the plethora of associated radiology reports. Yet, a prevailing concern is the lack of assurance on whether these models genuinely comprehend their generated content. With the evolution of text conditional image generation, these models have grown potent enough to facilitate object localization scrutiny. Our research underscores this advancement in the critical realm of medical imaging, emphasizing the crucial role of interpretability. We further unravel a consequential trade-off between image fidelity – as gauged by conventional metrics – and model interpretability in generative diffusion models. Specifically, the adoption of learnable text encoders when fine-tuning results in diminished interpretability. Our in-depth exploration uncovers the underlying factors responsible for this divergence. Consequently, we present a set of design principles for the development of truly interpretable generative models. Code is available at https://github.com/MischaD/chest-distillation. \ No newline at end of file diff --git a/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding b/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding new file mode 100644 index 0000000000..aebf3fb5e6 --- /dev/null +++ b/data/2024/aaai/Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding @@ -0,0 +1 @@ +Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF. \ No newline at end of file diff --git a/data/2024/aaai/Training-Free Quantum Architecture Search b/data/2024/aaai/Training-Free Quantum Architecture Search new file mode 100644 index 0000000000..a178b1d20a --- /dev/null +++ b/data/2024/aaai/Training-Free Quantum Architecture Search @@ -0,0 +1 @@ +Variational quantum algorithm (VQA) derives advantages from its error resilience and high flexibility in quantum resource requirements, rendering it broadly applicable in the noisy intermediate-scale quantum era. As the performance of VQA highly relies on the structure of the parameterized quantum circuit, it is worthwhile to propose quantum architecture search (QAS) algorithms to automatically search for high-performance circuits. Nevertheless, existing QAS methods are time-consuming, requiring circuit training to assess circuit performance. This study pioneers training-free QAS by utilizing two training-free proxies to rank quantum circuits, in place of the expensive circuit training employed in conventional QAS. Taking into account the precision and computational overhead of the path-based and expressibility-based proxies, we devise a two-stage progressive training-free QAS (TF-QAS). Initially, directed acyclic graphs (DAGs) are employed for circuit representation, and a zero-cost proxy based on the number of paths in the DAG is designed to filter out a substantial portion of unpromising circuits. Subsequently, an expressibility-based proxy, finely reflecting circuit performance, is employed to identify high-performance circuits from the remaining candidates. These proxies evaluate circuit performance without circuit training, resulting in a remarkable reduction in computational cost compared to current training-based QAS methods. Simulations on three VQE tasks demonstrate that TF-QAS achieves a substantial enhancement of sampling efficiency ranging from 5 to 57 times compared to state-of-the-art QAS, while also being 6 to 17 times faster. \ No newline at end of file diff --git a/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction b/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction new file mode 100644 index 0000000000..1bab2ee1a8 --- /dev/null +++ b/data/2024/aaai/TransGOP: Transformer-Based Gaze Object Prediction @@ -0,0 +1 @@ +Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git. \ No newline at end of file diff --git a/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery b/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery new file mode 100644 index 0000000000..b9a3849b21 --- /dev/null +++ b/data/2024/aaai/Transfer and Alignment Network for Generalized Category Discovery @@ -0,0 +1,3 @@ +Generalized Category Discovery (GCD) is a crucial real-world task that aims to recognize both known and novel categories from an unlabeled dataset by leveraging another labeled dataset with only known categories. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. The former leads to unreliable estimation of learning targets for novel categories and the latter hinders models from learning discriminative features. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. +Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. +Our code and data are available at https://github.com/Lackel/TAN. \ No newline at end of file diff --git a/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion b/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion new file mode 100644 index 0000000000..3db4bf5999 --- /dev/null +++ b/data/2024/aaai/Transferable Adversarial Attacks for Object Detection Using Object-Aware Significant Feature Distortion @@ -0,0 +1 @@ +Transferable black-box adversarial attacks against classifiers by disturbing the intermediate-layer features have been extensively studied in recent years. However, these methods have not yet achieved satisfactory performances when directly applied to object detectors. This is largely because the features of detectors are fundamentally different from that of the classifiers. In this study, we propose a simple but effective method to improve the transferability of adversarial examples for object detectors by leveraging the properties of spatial consistency and limited equivariance of object detectors’ features. Specifically, we combine a novel loss function and deliberately designed data augmentation to distort the backbone features of object detectors by suppressing significant features corresponding to objects and amplifying the surrounding vicinal features corresponding to object boundaries. As such the target object and background area on the generated adversarial samples are more likely to be confused by other detectors. Extensive experimental results show that our proposed method achieves state-of-the-art black-box transferability for untargeted attacks on various models, including one/two-stage, CNN/Transformer-based, and anchor-free/anchor-based detectors. \ No newline at end of file diff --git a/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting b/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting new file mode 100644 index 0000000000..1c61fba4fb --- /dev/null +++ b/data/2024/aaai/Transferable Video Moment Localization by Moment-Guided Query Prompting @@ -0,0 +1 @@ +Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a cross-domain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method. \ No newline at end of file diff --git a/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning b/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning new file mode 100644 index 0000000000..c32fadd9c1 --- /dev/null +++ b/data/2024/aaai/Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning @@ -0,0 +1 @@ +Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on six standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA. \ No newline at end of file diff --git a/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement b/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement new file mode 100644 index 0000000000..cb67a8b04a --- /dev/null +++ b/data/2024/aaai/Transformer-Based Selective Super-resolution for Efficient Image Refinement @@ -0,0 +1 @@ +Conventional super-resolution methods suffer from two drawbacks: substantial computational cost in upscaling an entire large image, and the introduction of extraneous or potentially detrimental information for downstream computer vision tasks during the refinement of the background. To solve these issues, we propose a novel transformer-based algorithm, Selective Super-Resolution (SSR), which partitions images into non-overlapping tiles, selects tiles of interest at various scales with a pyramid architecture, and exclusively reconstructs these selected tiles with deep features. Experimental results on three datasets demonstrate the efficiency and robust performance of our approach for super-resolution. Compared to the state-of-the-art methods, the FID score is reduced from 26.78 to 10.41 with 40% reduction in computation cost for the BDD100K dataset. \ No newline at end of file diff --git a/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification b/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification new file mode 100644 index 0000000000..ffe93e9cd0 --- /dev/null +++ b/data/2024/aaai/Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification @@ -0,0 +1 @@ +Pathological images play a vital role in clinical cancer diagnosis. Computer-aided diagnosis utilized on digital Whole Slide Images (WSIs) has been widely studied. The major challenge of using deep learning models for WSI analysis is the huge size of WSI images and existing methods struggle between end-to-end learning and proper modeling of contextual information. Most state-of-the-art methods utilize a two-stage strategy, in which they use a pre-trained model to extract features of small patches cut from a WSI and then input these features into a classification model. These methods can not perform end-to-end learning and consider contextual information at the same time. To solve this problem, we propose a framework that models a WSI as a pathologist's observing video and utilizes Transformer to process video clips with a divide-and-conquer strategy, which helps achieve both context-awareness and end-to-end learning. Extensive experiments on three public WSI datasets show that our proposed method outperforms existing SOTA methods in both WSI classification and positive region detection. \ No newline at end of file diff --git a/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce b/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce new file mode 100644 index 0000000000..4472aeb16b --- /dev/null +++ b/data/2024/aaai/Transformer-Empowered Multi-Modal Item Embedding for Enhanced Image Search in E-commerce @@ -0,0 +1 @@ +Over the past decade, significant advances have been made in the field of image search for e-commerce applications. Traditional image-to-image retrieval models, which focus solely on image details such as texture, tend to overlook useful semantic information contained within the images. As a result, the retrieved products might possess similar image details, but fail to fulfil the user's search goals. Moreover, the use of image-to-image retrieval models for products containing multiple images results in significant online product feature storage overhead and complex mapping implementations. In this paper, we report the design and deployment of the proposed Multi-modal Item Embedding Model (MIEM) to address these limitations. It is capable of utilizing both textual information and multiple images about a product to construct meaningful product features. By leveraging semantic information from images, MIEM effectively supplements the image search process, improving the overall accuracy of retrieval results. MIEM has become an integral part of the Shopee image search platform. Since its deployment in March 2023, it has achieved a remarkable 9.90% increase in terms of clicks per user and a 4.23% boost in terms of orders per user for the image search feature on the Shopee e-commerce platform. \ No newline at end of file diff --git a/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality b/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality new file mode 100644 index 0000000000..99e5871bb5 --- /dev/null +++ b/data/2024/aaai/Transforming Healthcare: A Comprehensive Approach to Mitigating Bias and Fostering Empathy through AI-Driven Augmented Reality @@ -0,0 +1 @@ +The integration of Artificial Intelligence (AI) into Augmented Reality (AR) for medical applications is propelled by the aim to address evident healthcare disparities. Certain communities have encountered disparities in medical diagnoses, exemplified by Black individuals exhibiting a 2.4 times higher likelihood of schizophrenia diagnosis compared to their white counterparts (Faber et al., 2023). These disparities often arise from structured interview assessments overlooking cultural nuances, resulting in increased misdiagnosis rates. This study leverages AI and AR to develop unbiased diagnostic tools and enhance empathy in healthcare professionals' training. Uniquely prioritizing the reduction of biased language and the fostering of empathy through AI-driven Natural Language Processing (NLP) and AI-driven virtual patients, the research aims to enhance diagnostic accuracy while promoting cultural sensitivity among healthcare professionals. Aligned with broader goals of achieving equitable healthcare and reducing disparities, the evaluation involves pre- and post-training assessments to measure language improvements and empathy enhancements. Successful implementation could lead to a more equitable healthcare landscape, fostering trust in AI-driven systems and ensuring fairer medical care for diverse communities. \ No newline at end of file diff --git a/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera b/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera new file mode 100644 index 0000000000..92e594769d --- /dev/null +++ b/data/2024/aaai/Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera @@ -0,0 +1 @@ +The de-occlusion problem, involving extracting clear background images by removing foreground occlusions, holds significant practical importance but poses considerable challenges. Most current research predominantly focuses on generating discrete images from calibrated camera arrays, but this approach often struggles with dense occlusions and fast motions due to limited perspectives and motion blur. To overcome these limitations, an effective solution requires the integration of multi-view visual information. The spike camera, as an innovative neuromorphic sensor, shows promise with its ultra-high temporal resolution and dynamic range. In this study, we propose a novel approach that utilizes a single spike camera for continuous multi-view imaging to address occlusion removal. By rapidly moving the spike camera, we capture a dense stream of spikes from occluded scenes. Our model, SpkOccNet, processes these spikes by integrating multi-view spatial-temporal information via long-short-window feature extractor (LSW) and employs a novel cross-view mutual attention-based module (CVA) for effective fusion and refinement. Additionally, to facilitate research in occlusion removal, we introduce the S-OCC dataset, which consists of real-world spike-based data. Experimental results demonstrate the efficiency and generalization capabilities of our model in effectively removing dense occlusions across diverse scenes. Public project page: https://github.com/Leozhangjiyuan/SpikeDeOcclusion. \ No newline at end of file diff --git a/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games b/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games new file mode 100644 index 0000000000..6d455e4b6b --- /dev/null +++ b/data/2024/aaai/Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games @@ -0,0 +1 @@ +Many real-world scenarios including fleet management and Ad auctions can be modeled as Stackelberg mean-field games (SMFGs) where a leader aims to incentivize a large number of homogeneous self-interested followers to maximize her utility. Existing works focus on cases with a small number of heterogeneous followers, e.g., 5-10, and suffer from scalability issue when the number of followers increases. There are three major challenges in solving large-scale SMFGs: i) classical methods based on solving differential equations fail as they require exact dynamics parameters, ii) learning by interacting with environment is data-inefficient, and iii) complex interaction between the leader and followers makes the learning performance unstable. We address these challenges through transition-informed reinforcement learning. Our main contributions are threefold: i) we first propose an RL framework, the Stackelberg mean-field update, to learn the leader's policy without priors of the environment, ii) to improve the data efficiency and accelerate the learning process, we then propose the Transition-Informed Reinforcement Learning (TIRL) by leveraging the instantiated empirical Fokker-Planck equation, and iii) we develop a regularized TIRL by employing various regularizers to alleviate the sensitivity of the learning performance to the initialization of the leader's policy. Extensive experiments on fleet management and food gathering demonstrate that our approach can scale up to 100,000 followers and significantly outperform existing baselines. \ No newline at end of file diff --git a/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity b/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity new file mode 100644 index 0000000000..d11c9cf3aa --- /dev/null +++ b/data/2024/aaai/Transitivity-Preserving Graph Representation Learning for Bridging Local Connectivity and Role-Based Similarity @@ -0,0 +1 @@ +Graph representation learning (GRL) methods, such as graph neural networks and graph transformer models, have been successfully used to analyze graph-structured data, mainly focusing on node classification and link prediction tasks. However, the existing studies mostly only consider local connectivity while ignoring long-range connectivity and the roles of nodes. In this paper, we propose Unified Graph Transformer Networks (UGT) that effectively integrate local and global structural information into fixed-length vector representations. First, UGT learns local structure by identifying the local sub-structures and aggregating features of the k-hop neighborhoods of each node. Second, we construct virtual edges, bridging distant nodes with structural similarity to capture the long-range dependencies. Third, UGT learns unified representations through self-attention, encoding structural distance and p-step transition probability between node pairs. Furthermore, we propose a self-supervised learning task that effectively learns transition probability to fuse local and global structural features, which could then be transferred to other downstream tasks. Experimental results on real-world benchmark datasets over various downstream tasks showed that UGT significantly outperformed baselines that consist of state-of-the-art models. In addition, UGT reaches the third-order Weisfeiler-Lehman power to distinguish non-isomorphic graph pairs. \ No newline at end of file diff --git a/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models b/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models new file mode 100644 index 0000000000..227601bca7 --- /dev/null +++ b/data/2024/aaai/Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models @@ -0,0 +1 @@ +To translate well, machine translation (MT) systems and general-purposed language models (LMs) need a deep understanding of both source and target languages and cultures. Therefore, idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems, as literal translations often miss the intended meaning. Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context-awareness. Addressing these challenges, our approach prioritizes context-awareness and scalability, allowing for offline storage of idioms in a manageable KB size. This ensures efficient serving with smaller models and provides a more comprehensive understanding of idiomatic expressions. We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this. This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B), by retrieving idioms' figurative meanings. We present a novel, GPT-4-powered metric for human-aligned evaluation, demonstrating that IdiomKB considerably boosts model performance. Human evaluations further validate our KB's quality. \ No newline at end of file diff --git a/data/2024/aaai/Transportable Representations for Domain Generalization b/data/2024/aaai/Transportable Representations for Domain Generalization new file mode 100644 index 0000000000..88c10dc57e --- /dev/null +++ b/data/2024/aaai/Transportable Representations for Domain Generalization @@ -0,0 +1 @@ +One key assumption in machine learning literature is that the testing and training data come from the same distribution, which is often violated in practice. The anchors that allow generalizations to take place are causal, and provenient in terms of the stability and modularity of the mechanisms underlying the system of variables. Building on the theory of causal transportability, we define the notion of ``transportable representations", and show that these representations are suitable candidates for the domain generalization task. Specifically, considering that the graphical assumptions about the underlying system are provided, the transportable representations can be characterized accordingly, and the distribution of label conditioned on the representation can be computed in terms of the source distributions. Finally, we relax the assumption of having access to the underlying graph by proving a graphical-invariance duality theorem, which delineates certain probabilistic invariances present in the source data as a sound and complete criterion for generalizable classification. Our findings provide a unifying theoretical basis for several existing approaches to the domain generalization problem. \ No newline at end of file diff --git a/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation b/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation new file mode 100644 index 0000000000..3fdedad5f7 --- /dev/null +++ b/data/2024/aaai/Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation @@ -0,0 +1 @@ +Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted. \ No newline at end of file diff --git a/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization b/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization new file mode 100644 index 0000000000..6b89fa62f3 --- /dev/null +++ b/data/2024/aaai/Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization @@ -0,0 +1 @@ +While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts. \ No newline at end of file diff --git a/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models b/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models new file mode 100644 index 0000000000..598fd8f25d --- /dev/null +++ b/data/2024/aaai/Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) have recently demonstrated remarkable performance across various Natual Language Processing tasks. In the field of multi-hop reasoning, the Chain-of-thought (CoT) prompt method has emerged as a paradigm, using curated stepwise reasoning demonstrations to enhance LLM's ability to reason and produce coherent rational pathways. To ensure the accuracy, reliability, and traceability of the generated answers, many studies have incorporated information retrieval (IR) to provide LLMs with external knowledge. However, existing CoT with IR methods decomposes questions into sub-questions based on a single compositionality type, which limits their effectiveness for questions involving multiple compositionality types. Additionally, these methods suffer from inefficient retrieval, as complex questions often contain abundant information, leading to the retrieval of irrelevant information inconsistent with the query's intent. In this work, we propose a novel question decomposition framework called TRQA for multi-hop question answering, which addresses these limitations. Our framework introduces a reasoning tree (RT) to represent the structure of complex questions. It consists of four components: the Reasoning Tree Constructor (RTC), the Question Generator (QG), the Retrieval and LLM Interaction Module (RAIL), and the Answer Aggregation Module (AAM). Specifically, the RTC predicts diverse sub-question structures to construct the reasoning tree, allowing a more comprehensive representation of complex questions. The QG generates sub-questions for leaf-node in the reasoning tree, and we explore two methods for QG: prompt-based and T5-based approaches. The IR module retrieves documents aligned with sub-questions, while the LLM formulates answers based on the retrieved information. Finally, the AAM aggregates answers along the reason tree, producing a definitive response from bottom to top. \ No newline at end of file diff --git a/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation b/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation new file mode 100644 index 0000000000..afe2663971 --- /dev/null +++ b/data/2024/aaai/Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation @@ -0,0 +1 @@ +With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose Trend-AwareSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference. \ No newline at end of file diff --git a/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval b/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval new file mode 100644 index 0000000000..56f5d32700 --- /dev/null +++ b/data/2024/aaai/TriSampler: A Better Negative Sampling Principle for Dense Retrieval @@ -0,0 +1 @@ +Negative sampling stands as a pivotal technique in dense retrieval, essential for training effective retrieval models and significantly impacting retrieval performance. While existing negative sampling methods have made commendable progress by leveraging hard negatives, a comprehensive guiding principle for constructing negative candidates and designing negative sampling distributions is still lacking. To bridge this gap, we embark on a theoretical analysis of negative sampling in dense retrieval. This exploration culminates in the unveiling of the quasi-triangular principle, a novel framework that elucidates the triangular-like interplay between query, positive document, and negative document. Fueled by this guiding principle, we introduce TriSampler, a straightforward yet highly effective negative sampling method. The keypoint of TriSampler lies in its ability to selectively sample more informative negatives within a prescribed constrained region. Experimental evaluation show that TriSampler consistently attains superior retrieval performance across a diverse of representative retrieval models. \ No newline at end of file diff --git a/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection b/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection new file mode 100644 index 0000000000..af8a1cebce --- /dev/null +++ b/data/2024/aaai/Triple Feature Disentanglement for One-Stage Adaptive Object Detection @@ -0,0 +1 @@ +In recent advancements concerning Domain Adaptive Object Detection (DAOD), unsupervised domain adaptation techniques have proven instrumental. These methods enable enhanced detection capabilities within unlabeled target domains by mitigating distribution differences between source and target domains. A subset of DAOD methods employs disentangled learning to segregate Domain-Specific Representations (DSR) and Domain-Invariant Representations (DIR), with ultimate predictions relying on the latter. Current practices in disentanglement, however, often lead to DIR containing residual domain-specific information. To address this, we introduce the Multi-level Disentanglement Module (MDM) that progressively disentangles DIR, enhancing comprehensive disentanglement. Additionally, our proposed Cyclic Disentanglement Module (CDM) facilitates DSR separation. To refine the process further, we employ the Categorical Features Disentanglement Module (CFDM) to isolate DIR and DSR, coupled with category alignment across scales for improved source-target domain alignment. Given its practical suitability, our model is constructed upon the foundational framework of the Single Shot MultiBox Detector (SSD), which is a one-stage object detection approach. Experimental validation highlights the effectiveness of our method, demonstrating its state-of-the-art performance across three benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness b/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness new file mode 100644 index 0000000000..06c5b6f01f --- /dev/null +++ b/data/2024/aaai/Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness @@ -0,0 +1,4 @@ +In many important machine learning applications, the standard assumption of having a globally Lipschitz continuous gradient may fail to hold. This paper delves into a more general (L0, L1)-smoothness setting, which gains particular significance within the realms of deep neural networks and distributionally robust optimization (DRO). We demonstrate the significant advantage of trust region methods for stochastic nonconvex optimization under such generalized smoothness assumption. + We show that first-order trust region methods can recover the normalized and clipped stochastic gradient as special cases and then provide a unified analysis to show their convergence to first-order stationary conditions. + Motivated by the important application of DRO, we propose a generalized high-order smoothness condition, under which second-order trust region methods can achieve a complexity of O(epsilon(-3.5)) for convergence to second-order stationary points. By incorporating variance reduction, the second-order trust region method obtains an even better complexity of O(epsilon(-3)), matching the optimal bound for standard smooth optimization. To our best knowledge, this is the first work to show convergence beyond the first-order stationary condition for generalized smooth optimization. + Preliminary experiments show that our proposed algorithms perform favorably compared with existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning b/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning new file mode 100644 index 0000000000..1109a82835 --- /dev/null +++ b/data/2024/aaai/Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning @@ -0,0 +1 @@ +Despite the great success of large language models (LLMs) in various tasks, they suffer from generating hallucinations. We introduce Truth Forest, a method that enhances truthfulness in LLMs by uncovering hidden truth representations using multi-dimensional orthogonal probes. Specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. Moreover, we introduce Random Peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in LLMs. By employing this approach, we improved the truthfulness of Llama-2-7B from 40.8% to 74.5% on TruthfulQA. Likewise, significant improvements are observed in fine-tuned models. We conducted a thorough analysis of truth features using probes. Our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset. \ No newline at end of file diff --git a/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing b/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing new file mode 100644 index 0000000000..199d25cdd0 --- /dev/null +++ b/data/2024/aaai/Tuning-Free Inversion-Enhanced Control for Consistent Image Editing @@ -0,0 +1 @@ +Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings. \ No newline at end of file diff --git a/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients b/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients new file mode 100644 index 0000000000..7b0df057a9 --- /dev/null +++ b/data/2024/aaai/TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients @@ -0,0 +1 @@ +Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in cross-device federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC. \ No newline at end of file diff --git a/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data b/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data new file mode 100644 index 0000000000..04376a53d3 --- /dev/null +++ b/data/2024/aaai/Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data @@ -0,0 +1 @@ +Large Language Models (LLMs) have performed well on various reasoning tasks, but their inaccessibility and numerous parameters hinder wide application in practice. One promising way is distilling the reasoning ability from LLMs to small models by the generated chain-of-thought reasoning paths. In some cases, however, LLMs may produce incorrect reasoning chains, especially when facing complex mathematical problems. Previous studies only transfer knowledge from positive samples and drop the synthesized data with wrong answers. In this work, we illustrate the merit of negative data and propose a model specialization framework to distill LLMs with negative samples besides positive ones. The framework consists of three progressive steps, covering from training to inference stages, to absorb knowledge from negative data. We conduct extensive experiments across arithmetic reasoning tasks to demonstrate the role of negative data in distillation from LLM. \ No newline at end of file diff --git a/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks b/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks new file mode 100644 index 0000000000..0ad1cd9526 --- /dev/null +++ b/data/2024/aaai/Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks @@ -0,0 +1 @@ +Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA. \ No newline at end of file diff --git a/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation b/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation new file mode 100644 index 0000000000..1aa6c0a38a --- /dev/null +++ b/data/2024/aaai/Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation @@ -0,0 +1 @@ +The integration of Evolutionary Algorithm (EA) and Reinforcement Learning (RL) has emerged as a promising approach for tackling some challenges in RL, such as sparse rewards, lack of exploration, and brittle convergence properties. However, existing methods often employ actor networks as individuals of EA, which may constrain their exploratory capabilities, as the entire actor population will stop evolution when the critic network in RL falls into local optimal. To alleviate this issue, this paper introduces a Two-stage Evolutionary Reinforcement Learning (TERL) framework that maintains a population containing both actor and critic networks. TERL divides the learning process into two stages. In the initial stage, individuals independently learn actor-critic networks, which are optimized alternatively by RL and Particle Swarm Optimization (PSO). This dual optimization fosters greater exploration, curbing susceptibility to local optima. Shared information from a common replay buffer and PSO algorithm substantially mitigates the computational load of training multiple agents. In the subsequent stage, TERL shifts to a refined exploitation phase. Here, only the best individual undergoes further refinement, while the rest individuals continue PSO-based optimization. This allocates more computational resources to the best individual for yielding superior performance. Empirical assessments, conducted across a range of continuous control problems, validate the efficacy of the proposed TERL paradigm. \ No newline at end of file diff --git a/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting b/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting new file mode 100644 index 0000000000..f67c4b2455 --- /dev/null +++ b/data/2024/aaai/U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting @@ -0,0 +1 @@ +Time series forecasting is a crucial task in various domains. Caused by factors such as trends, seasonality, or irregular fluctuations, time series often exhibits non-stationary. It obstructs stable feature propagation through deep layers, disrupts feature distributions, and complicates learning data distribution changes. As a result, many existing models struggle to capture the underlying patterns, leading to degraded forecasting performance. In this study, we tackle the challenge of non-stationarity in time series forecasting with our proposed framework called U-Mixer. By combining Unet and Mixer, U-Mixer effectively captures local temporal dependencies between different patches and channels separately to avoid the influence of distribution variations among channels, and merge low- and high-levels features to obtain comprehensive data representations. The key contribution is a novel stationarity correction method, explicitly restoring data distribution by constraining the difference in stationarity between the data before and after model processing to restore the non-stationarity information, while ensuring the temporal dependencies are preserved. Through extensive experiments on various real-world time series datasets, U-Mixer demonstrates its effectiveness and robustness, and achieves 14.5% and 7.7% improvements over state-of-the-art (SOTA) methods. \ No newline at end of file diff --git a/data/2024/aaai/U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making b/data/2024/aaai/U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation b/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation new file mode 100644 index 0000000000..8bd819b790 --- /dev/null +++ b/data/2024/aaai/UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation @@ -0,0 +1 @@ +Multi-object tracking (MOT) in video sequences remains a challenging task, especially in scenarios with significant camera movements. This is because targets can drift considerably on the image plane, leading to erroneous tracking outcomes. Addressing such challenges typically requires supplementary appearance cues or Camera Motion Compensation (CMC). While these strategies are effective, they also introduce a considerable computational burden, posing challenges for real-time MOT. In response to this, we introduce UCMCTrack, a novel motion model-based tracker robust to camera movements. Unlike conventional CMC that computes compensation parameters frame-by-frame, UCMCTrack consistently applies the same compensation parameters throughout a video sequence. It employs a Kalman filter on the ground plane and introduces the Mapped Mahalanobis Distance (MMD) as an alternative to the traditional Intersection over Union (IoU) distance measure. By leveraging projected probability distributions on the ground plane, our approach efficiently captures motion patterns and adeptly manages uncertainties introduced by homography projections. Remarkably, UCMCTrack, relying solely on motion cues, achieves state-of-the-art performance across a variety of challenging datasets, including MOT17, MOT20, DanceTrack and KITTI. More details and code are available at https://github.com/corfyi/UCMCTrack. \ No newline at end of file diff --git a/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions b/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions new file mode 100644 index 0000000000..c156452912 --- /dev/null +++ b/data/2024/aaai/UFDA: Universal Federated Domain Adaptation with Practical Assumptions @@ -0,0 +1 @@ +Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for real-world situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent, and the target-domain label set is totally blind. Towards a more effective solution for our newly proposed UFDA scenario, we propose a corresponding methodology called Hot-Learning with Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain shifts and category gaps problems by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. Extensive experiments on three benchmark datasets demonstrate that our method achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to previous methodologies with comprehensive additional assumptions. \ No newline at end of file diff --git a/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation b/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation new file mode 100644 index 0000000000..6bb068b99f --- /dev/null +++ b/data/2024/aaai/UMA: Facilitating Backdoor Scanning via Unlearning-Based Model Ablation @@ -0,0 +1 @@ +Recent advances in backdoor attacks, like leveraging complex triggers or stealthy implanting techniques, have introduced new challenges in backdoor scanning, limiting the usability of Deep Neural Networks (DNNs) in various scenarios. In this paper, we propose Unlearning-based Model Ablation (UMA), a novel approach to facilitate backdoor scanning and defend against advanced backdoor attacks. UMA filters out backdoor-irrelevant features by ablating the inherent features of the target class within the model and subsequently reveals the backdoor through dynamic trigger optimization. We evaluate our method on 1700 models (700 benign and 1000 trojaned) with 6 model structures, 7 different backdoor attacks and 4 datasets. Our results demonstrate that the proposed methodology effectively detect these advanced backdoors. Specifically, our method can achieve 91% AUC-ROC and 86.6% detection accuracy on average, which outperforms the baselines, including Neural Cleanse, ABS, K-Arm and MNTD. \ No newline at end of file diff --git a/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer b/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer new file mode 100644 index 0000000000..986f0a74ed --- /dev/null +++ b/data/2024/aaai/UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer @@ -0,0 +1 @@ +Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet with directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model. \ No newline at end of file diff --git a/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification b/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification new file mode 100644 index 0000000000..f14f62ddf3 --- /dev/null +++ b/data/2024/aaai/UV-SAM: Adapting Segment Anything Model for Urban Village Identification @@ -0,0 +1 @@ +Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM. \ No newline at end of file diff --git a/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation b/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation new file mode 100644 index 0000000000..e73cbd5e65 --- /dev/null +++ b/data/2024/aaai/UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation @@ -0,0 +1 @@ +Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on single-view facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: https://github.com/MickeyLLG/UVAGaze. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation b/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation new file mode 100644 index 0000000000..9e72a117b0 --- /dev/null +++ b/data/2024/aaai/Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation @@ -0,0 +1 @@ +Accurately detecting multiple change-points is critical for various applications, but determining the optimal number of change-points remains a challenge. Existing approaches based on information criteria attempt to balance goodness-of-fit and model complexity, but their performance varies depending on the model. Recently, data-driven selection criteria based on cross-validation has been proposed, but these methods can be prone to slight overfitting in finite samples. In this paper, we introduce a method that controls the probability of overestimation and provides uncertainty quantification for learning multiple change-points via cross-validation. We frame this problem as a sequence of model comparison problems and leverage high-dimensional inferential procedures. We demonstrate the effectiveness of our approach through experiments on finite-sample data, showing superior uncertainty quantification for overestimation compared to existing methods. Our approach has broad applicability and can be used in diverse change-point models. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution b/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution new file mode 100644 index 0000000000..62862be902 --- /dev/null +++ b/data/2024/aaai/Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution @@ -0,0 +1 @@ +Deep learning-based surrogate models have demonstrated remarkable advantages over classical solvers in terms of speed, often achieving speedups of 10 to 1000 times over traditional partial differential equation (PDE) solvers. However, a significant challenge hindering their widespread adoption in both scientific and industrial domains is the lack of understanding about their prediction uncertainties, particularly in scenarios that involve critical decision making. To address this limitation, we propose a method that integrates efficient and precise uncertainty quantification into a deep learning-based surrogate model. Our method, termed Latent Evolution of PDEs with Uncertainty Quantification (LE-PDE-UQ), endows deep learning-based surrogate models with robust and efficient uncertainty quantification capabilities for both forward and inverse problems. LE-PDE-UQ leverages latent vectors within a latent space to evolve both the system's state and its corresponding uncertainty estimation. The latent vectors are decoded to provide predictions for the system's state as well as estimates of its uncertainty. In extensive experiments, we demonstrate the accurate uncertainty quantification performance of our approach, surpassing that of strong baselines including deep ensembles, Bayesian neural network layers, and dropout. Our method excels at propagating uncertainty over extended auto-regressive rollouts, making it suitable for scenarios involving long-term predictions. Our code is available at: https://github.com/AI4Science-WestlakeU/le-pde-uq. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model b/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model new file mode 100644 index 0000000000..f158d6d351 --- /dev/null +++ b/data/2024/aaai/Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model @@ -0,0 +1 @@ +Estimating heterogeneous treatment effects across individuals has attracted growing attention as a statistical tool for performing critical decision-making. We propose a Bayesian inference framework that quantifies the uncertainty in treatment effect estimation to support decision-making in a relatively small sample size setting. Our proposed model places Gaussian process priors on the nonparametric components of a semiparametric model called a partially linear model. This model formulation has three advantages. First, we can analytically compute the posterior distribution of a treatment effect without relying on the computationally demanding posterior approximation. Second, we can guarantee that the posterior distribution concentrates around the true one as the sample size goes to infinity. Third, we can incorporate prior knowledge about a treatment effect into the prior distribution, improving the estimation efficiency. Our experimental results show that even in the small sample size setting, our method can accurately estimate the heterogeneous treatment effects and effectively quantify its estimation uncertainty. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty Regularized Evidential Regression b/data/2024/aaai/Uncertainty Regularized Evidential Regression new file mode 100644 index 0000000000..6401340c28 --- /dev/null +++ b/data/2024/aaai/Uncertainty Regularized Evidential Regression @@ -0,0 +1 @@ +The Evidential Regression Network (ERN) represents a novel approach that integrates deep learning with Dempster-Shafer's theory to predict a target and quantify the associated uncertainty. Guided by the underlying theory, specific activation functions must be employed to enforce non-negative values, which is a constraint that compromises model performance by limiting its ability to learn from all samples. This paper provides a theoretical analysis of this limitation and introduces an improvement to overcome it. Initially, we define the region where the models can't effectively learn from the samples. Following this, we thoroughly analyze the ERN and investigate this constraint. Leveraging the insights from our analysis, we address the limitation by introducing a novel regularization term that empowers the ERN to learn from the whole training set. Our extensive experiments substantiate our theoretical findings and demonstrate the effectiveness of the proposed solution. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution b/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution new file mode 100644 index 0000000000..3f777da83a --- /dev/null +++ b/data/2024/aaai/Uncertainty-Aware GAN for Single Image Super Resolution @@ -0,0 +1 @@ +Generative adversarial network (GAN) has become a popular tool in the perceptual-oriented single image super-resolution (SISR) for its excellent capability to hallucinate details. However, the performance of most GAN-based SISR methods is impeded due to the limited discriminative ability of their discriminators. In specific, these discriminators only focus on the global image reconstruction quality and ignore the more fine-grained reconstruction quality for constraining the generator, as they predict the overall realness of an image instead of the pixel-level realness. Here, we first introduce the uncertainty into the GAN and propose an Uncertainty-aware GAN (UGAN) to regularize SISR solutions, where the challenging pixels with large reconstruction uncertainty and importance (e.g., texture and edge) are prioritized for optimization. The uncertainty-aware adversarial training strategy enables the discriminator to capture the pixel-level SR uncertainty, which constrains the generator to focus on image areas with high reconstruction difficulty, meanwhile, it improves the interpretability of the SR. To balance weights of multiple training losses, we introduce an uncertainty-aware loss weighting strategy to adaptively learn the optimal loss weights. Extensive experiments demonstrate the effectiveness of our approach in extracting the SR uncertainty and the superiority of the UGAN over the state-of-the-arts in terms of the reconstruction accuracy and perceptual quality. \ No newline at end of file diff --git a/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features b/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features new file mode 100644 index 0000000000..44912e005e --- /dev/null +++ b/data/2024/aaai/Uncertainty-Aware Yield Prediction with Multimodal Molecular Features @@ -0,0 +1,2 @@ +Predicting chemical reaction yields is pivotal for efficient chemical synthesis, an area that focuses on the creation of novel compounds for diverse uses. +Yield prediction demands accurate representations of reactions for forecasting practical transformation rates. Yet, the uncertainty issues broadcasting in real-world situations prohibit current models to excel in this task owing to the high sensitivity of yield activities and the uncertainty in yield measurements. Existing models often utilize single-modal feature representations, such as molecular fingerprints, SMILES sequences, or molecular graphs, which is not sufficient to capture the complex interactions and dynamic behavior of molecules in reactions. In this paper, we present an advanced Uncertainty-Aware Multimodal model (UAM) to tackle these challenges. Our approach seamlessly integrates data sources from multiple modalities by encompassing sequence representations, molecular graphs, and expert-defined chemical reaction features for a comprehensive representation of reactions. Additionally, we address both the model and data-based uncertainty, refining the model's predictive capability. Extensive experiments on three datasets, including two high throughput experiment (HTE) datasets and one chemist-constructed Amide coupling reaction dataset, demonstrate that UAM outperforms the state-of-the-art methods. The code and used datasets are available at https://github.com/jychen229/Multimodal-reaction-yield-prediction. \ No newline at end of file diff --git a/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification b/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification new file mode 100644 index 0000000000..755ccc440e --- /dev/null +++ b/data/2024/aaai/Uncovering and Mitigating the Hidden Chasm: A Study on the Text-Text Domain Gap in Euphemism Identification @@ -0,0 +1 @@ +Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. Thus, euphemism identification, i.e., mapping a given euphemism (“weed”) to its specific target word (“marijuana”), is essential for improving content moderation and combating underground markets. Existing methods employ self-supervised schemes to automatically construct labeled training datasets for euphemism identification. However, they overlook the text-text domain gap caused by the discrepancy between the constructed training data and the test data, leading to performance deterioration. In this paper, we present the text-text domain gap and explain how it forms in terms of the data distribution and the cone effect. Moreover, to bridge this gap, we introduce a feature alignment network (FA-Net), which can both align the in-domain and cross-domain features, thus mitigating the domain gap from training data to test data and improving the performance of the base models for euphemism identification. We apply this FA-Net to the base models, obtaining markedly better results, and creating a state-of-the-art model which beats the large language models. \ No newline at end of file diff --git a/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution b/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution new file mode 100644 index 0000000000..453035342a --- /dev/null +++ b/data/2024/aaai/Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution @@ -0,0 +1 @@ +Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user’s intent of producing natural language at inference time, however only one word will minimize the task’s loss function at training time. We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT-3.5, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https://github.com/2dot71mily/uspec. \ No newline at end of file diff --git a/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision b/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision new file mode 100644 index 0000000000..15004644fc --- /dev/null +++ b/data/2024/aaai/Understanding Distributed Representations of Concepts in Deep Neural Networks without Supervision @@ -0,0 +1 @@ +Understanding intermediate representations of the concepts learned by deep learning classifiers is indispensable for interpreting general model behaviors. Existing approaches to reveal learned concepts often rely on human supervision, such as pre-defined concept sets or segmentation processes. In this paper, we propose a novel unsupervised method for discovering distributed representations of concepts by selecting a principal subset of neurons. Our empirical findings demonstrate that instances with similar neuron activation states tend to share coherent concepts. Based on the observations, the proposed method selects principal neurons that construct an interpretable region, namely a Relaxed Decision Region (RDR), encompassing instances with coherent concepts in the feature space. It can be utilized to identify unlabeled subclasses within data and to detect the causes of misclassifications. Furthermore, the applicability of our method across various layers discloses distinct distributed representations over the layers, which provides deeper insights into the internal mechanisms of the deep learning model. \ No newline at end of file diff --git a/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection b/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection new file mode 100644 index 0000000000..bd9070eabc --- /dev/null +++ b/data/2024/aaai/Understanding Likelihood of Normalizing Flow and Image Complexity through the Lens of Out-of-Distribution Detection @@ -0,0 +1,8 @@ +Out-of-distribution (OOD) detection is crucial to safety-critical machine learning applications and has been extensively studied. +While recent studies have predominantly focused on classifier-based methods, research on deep generative model (DGM)-based methods have lagged relatively. +This disparity may be attributed to a perplexing phenomenon: DGMs often assign higher likelihoods to unknown OOD inputs than to their known training data. +This paper focuses on explaining the underlying mechanism of this phenomenon. +We propose a hypothesis that less complex images concentrate in high-density regions in the latent space, resulting in a higher likelihood assignment in the Normalizing Flow (NF). +We experimentally demonstrate its validity for five NF architectures, concluding that their likelihood is untrustworthy. +Additionally, we show that this problem can be alleviated by treating image complexity as an independent variable. +Finally, we provide evidence of the potential applicability of our hypothesis in another DGM, PixelCNN++. \ No newline at end of file diff --git a/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning b/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning new file mode 100644 index 0000000000..22ee72c690 --- /dev/null +++ b/data/2024/aaai/Understanding Surprising Generalization Phenomena in Deep Learning @@ -0,0 +1 @@ +Deep learning has exhibited a number of surprising generalization phenomena that are not captured by classical statistical learning theory. This talk will survey some of my work on the theoretical characterizations of several such intriguing phenomena: (1) Implicit regularization: A major mystery in deep learning is that deep neural networks can often generalize well despite their excessive expressive capacity. Towards explaining this mystery, it has been suggested that commonly used gradient-based optimization algorithms enforce certain implicit regularization which effectively constrains the model capacity. (2) Benign overfitting: In certain scenarios, a model can perfectly fit noisily labeled training data, but still archives near-optimal test error at the same time, which is very different from the classical notion of overfitting. (3) Grokking: In certain scenarios, a model initially achieves perfect training accuracy but no generalization (i.e. no better than a random predictor), and upon further training, transitions to almost perfect generalization. Theoretically establishing these properties often involves making appropriate high-dimensional assumptions on the problem as well as a careful analysis of the training dynamics. \ No newline at end of file diff --git a/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks b/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks new file mode 100644 index 0000000000..069f9c9e6b --- /dev/null +++ b/data/2024/aaai/Understanding and Improving Optimization in Predictive Coding Networks @@ -0,0 +1 @@ +Backpropagation (BP), the standard learning algorithm for artificial neural networks, is often considered biologically implausible. In contrast, the standard learning algorithm for predictive coding (PC) models in neuroscience, known as the inference learning algorithm (IL), is a promising, bio-plausible alternative. However, several challenges and questions hinder IL's application to real-world problems. For example, IL is computationally demanding, and without memory-intensive optimizers like Adam, IL may converge to poor local minima. Moreover, although IL can reduce loss more quickly than BP, the reasons for these speedups or their robustness remains unclear. In this paper, we tackle these challenges by 1) altering the standard implementation of PC circuits to substantially reduce computation, 2) developing a novel optimizer that improves the convergence of IL without increasing memory usage, and 3) establishing theoretical results that help elucidate the conditions under which IL is sensitive to second and higher-order information. \ No newline at end of file diff --git a/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks b/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks new file mode 100644 index 0000000000..2ab55a1dc7 --- /dev/null +++ b/data/2024/aaai/Understanding and Leveraging the Learning Phases of Neural Networks @@ -0,0 +1 @@ +The learning dynamics of deep neural networks are not well understood. The information bottleneck (IB) theory proclaimed separate fitting and compression phases. But they have since been heavily debated. We comprehensively analyze the learning dynamics by investigating a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We empirically show the existence of three phases using common datasets and architectures such as ResNet and VGG: (i) near constant reconstruction loss, (ii) decrease, and (iii) increase. We also derive an empirically grounded data model and prove the existence of phases for single-layer networks. Technically, our approach leverages classical complexity analysis. It differs from IB by relying on measuring reconstruction loss rather than information theoretic measures to relate information of intermediate layers and inputs. Our work implies a new best practice for transfer learning: We show empirically that the pre-training of a classifier should stop well before its performance is optimal. \ No newline at end of file diff --git a/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data b/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data new file mode 100644 index 0000000000..cef45332b9 --- /dev/null +++ b/data/2024/aaai/Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data @@ -0,0 +1 @@ +This work tackles the important task of understanding out-of-distribution behavior in two prominent types of generative models, i.e., GANs and Diffusion models. Understanding this behavior is crucial in understanding their broader utility and risks as these systems are increasingly deployed in our daily lives. Our first contribution is demonstrating that diffusion spaces outperform GANs' latent spaces in inverting high-quality OOD images. We also provide a theoretical analysis attributing this to the lack of prior holes in diffusion spaces. Our second significant contribution is to provide a theoretical hypothesis that diffusion spaces can be projected onto a bounded hypersphere, enabling image manipulation through geodesic traversal between inverted images. Our analysis shows that different geodesics share common attributes for the same manipulation, which we leverage to perform various image manipulations. We conduct thorough empirical evaluations to support and validate our claims. Finally, our third and final contribution introduces a novel approach to the few-shot sampling for out-of-distribution data by inverting a few images to sample from the cluster formed by the inverted latents. The proposed technique achieves state-of-the-art results for the few-shot generation task in terms of image quality. Our research underscores the promise of diffusion spaces in out-of-distribution imaging and offers avenues for further exploration. Please find more details about our project at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/diffusionOOD} \ No newline at end of file diff --git a/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation b/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation new file mode 100644 index 0000000000..079a32024c --- /dev/null +++ b/data/2024/aaai/Understanding the Role of the Projector in Knowledge Distillation @@ -0,0 +1 @@ +In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available. \ No newline at end of file diff --git a/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance b/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance new file mode 100644 index 0000000000..a4b9ef4038 --- /dev/null +++ b/data/2024/aaai/Underwater Organism Color Fine-Tuning via Decomposition and Guidance @@ -0,0 +1 @@ +Due to the wavelength dependent light attenuation and scattering, the color of the underwater organism usually appears distorted. The existing underwater image enhancement methods mainly focus on designing networks capable of generating enhanced underwater organisms with fixed color. Due to the complexity of the underwater environment, ground truth labels are difficult to obtain, which results in the non-existence of perfect enhancement effects. Different from the existing methods, this paper proposes an algorithm with color enhancement and color fine-tuning (CECF) capabilities. The color enhancement behavior of CECF is the same as that of existing methods, aiming to restore the color of the distorted underwater organism. Beyond this general purpose, the color fine-tuning behavior of CECF can adjust the color of organisms in a controlled manner, which can generate enhanced organisms with diverse colors. To achieve this purpose, four processes are used in CECF. A supervised enhancement process learns the mapping from a distorted image to an enhanced image by the decomposition of color code. A self reconstruction process and a cross-reconstruction process are used for content-invariant learning. A color fine-tuning process is designed based on the guidance for obtaining various enhanced results with different colors. Experimental results have proven the enhancement ability and color fine-tuning ability of the proposed CECF. The source code is provided in https://github.com/Xiaofeng-life/CECF. \ No newline at end of file diff --git a/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction b/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction new file mode 100644 index 0000000000..6299a6ff9c --- /dev/null +++ b/data/2024/aaai/Uni-MIS: United Multiple Intent Spoken Language Understanding via Multi-View Intent-Slot Interaction @@ -0,0 +1 @@ +So far, multi-intent spoken language understanding (SLU) has become a research hotspot in the field of natural language processing (NLP) due to its ability to recognize and extract multiple intents expressed and annotate corresponding sequence slot tags within a single utterance. Previous research has primarily concentrated on the token-level intent-slot interaction to model joint intent detection and slot filling, which resulted in a failure to fully utilize anisotropic intent-guiding information during joint training. In this work, we present a novel architecture by modeling the multi-intent SLU as a multi-view intent-slot interaction. The architecture resolves the kernel bottleneck of unified multi-intent SLU by effectively modeling the intent-slot relations with utterance, chunk, and token-level interaction. We further develop a neural framework, namely Uni-MIS, in which the unified multi-intent SLU is modeled as a three-view intent-slot interaction fusion to better capture the interaction information after special encoding. A chunk-level intent detection decoder is used to sufficiently capture the multi-intent, and an adaptive intent-slot graph network is used to capture the fine-grained intent information to guide final slot filling. We perform extensive experiments on two widely used benchmark datasets for multi-intent SLU, where our model bets on all the current strong baselines, pushing the state-of-the-art performance of unified multi-intent SLU. Additionally, the ChatGPT benchmark that we have developed demonstrates that there is a considerable amount of potential research value in the field of multi-intent SLU. \ No newline at end of file diff --git a/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap b/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap new file mode 100644 index 0000000000..7c64e42b33 --- /dev/null +++ b/data/2024/aaai/UniADS: Universal Architecture-Distiller Search for Distillation Gap @@ -0,0 +1 @@ +In this paper, we present UniADS, the first Universal Architecture-Distiller Search framework for co-optimizing student architecture and distillation policies. Teacher-student distillation gap limits the distillation gains. Previous approaches seek to discover the ideal student architecture while ignoring distillation settings. In UniADS, we construct a comprehensive search space encompassing an architectural search for student models, knowledge transformations in distillation strategies, distance functions, loss weights, and other vital settings. To efficiently explore the search space, we utilize the NSGA-II genetic algorithm for better crossover and mutation configurations and employ the Successive Halving algorithm for search space pruning, resulting in improved search efficiency and promising results. Extensive experiments are performed on different teacher-student pairs using CIFAR-100 and ImageNet datasets. The experimental results consistently demonstrate the superiority of our method over existing approaches. Furthermore, we provide a detailed analysis of the search results, examining the impact of each variable and extracting valuable insights and practical guidance for distillation design and implementation. \ No newline at end of file diff --git a/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning b/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning new file mode 100644 index 0000000000..693b9ecede --- /dev/null +++ b/data/2024/aaai/UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning @@ -0,0 +1 @@ +Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples. \ No newline at end of file diff --git a/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding b/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding new file mode 100644 index 0000000000..8b81da00d9 --- /dev/null +++ b/data/2024/aaai/UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding @@ -0,0 +1 @@ +The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing. Audio samples are available at https://cpdu.github.io/unicats. \ No newline at end of file diff --git a/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning b/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning new file mode 100644 index 0000000000..11f8b551b5 --- /dev/null +++ b/data/2024/aaai/UniCell: Universal Cell Nucleus Classification via Prompt Learning @@ -0,0 +1 @@ +The recognition of multi-class cell nuclei can significantly facilitate the process of histopathological diagnosis. Numerous pathological datasets are currently available, but their annotations are inconsistent. Most existing methods require individual training on each dataset to deduce the relevant labels and lack the use of common knowledge across datasets, consequently restricting the quality of recognition. In this paper, we propose a universal cell nucleus classification framework (UniCell), which employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains. In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets. Moreover, we develop a Dynamic Prompt Module (DPM) that exploits the properties of multiple datasets to enhance features. The DPM first integrates the embeddings of datasets and semantic categories, and then employs the integrated prompts to refine image representations, efficiently harvesting the shared knowledge among the related cell types and data sources. Experimental results demonstrate that the proposed method effectively achieves the state-of-the-art results on four nucleus detection and classification benchmarks. Code and models are available at https://github.com/lhaof/UniCell \ No newline at end of file diff --git a/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models b/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models new file mode 100644 index 0000000000..13ec965a5a --- /dev/null +++ b/data/2024/aaai/UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models @@ -0,0 +1 @@ +Generative information retrieval, encompassing two major tasks of Generative Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained significant attention in natural language processing. Existing methods for GDR and GAR rely on separate retrieval and reader modules, which hinder simultaneous optimization. To overcome this, we present UniGen, a Unified Generative framework for retrieval and question answering that integrates both tasks into a single generative model leveraging the capabilities of large language models. UniGen employs a shared encoder and two distinct decoders for generative retrieval and question answering. To facilitate the learning of both tasks, we introduce connectors, generated by large language models, to bridge the gaps between query inputs and generation targets, as well as between document identifiers and answers. Furthermore, we propose an iterative enhancement strategy that leverages generated answers and retrieved documents to iteratively improve both tasks. Through extensive experiments on the MS MARCO and NQ datasets, we demonstrate the effectiveness of UniGen, showcasing its superior performance in both retrieval and question answering tasks. \ No newline at end of file diff --git a/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics b/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics new file mode 100644 index 0000000000..c2b6fdc3af --- /dev/null +++ b/data/2024/aaai/Unified Framework for Diffusion Generative Models in SO(3): Applications in Computer Vision and Astrophysics @@ -0,0 +1 @@ +Diffusion-based generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/cosmology science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results. Additionally, we demonstrate the practicality of our model on pose estimation tasks and in predicting correlated galaxy orientations for astrophysics/cosmology. \ No newline at end of file diff --git a/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype b/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype new file mode 100644 index 0000000000..a9b9e05fa5 --- /dev/null +++ b/data/2024/aaai/Unify Named Entity Recognition Scenarios via Contrastive Real-Time Updating Prototype @@ -0,0 +1 @@ +Supervised named entity recognition (NER) aims to classify entity mentions into a fixed number of pre-defined types. However, in real-world scenarios, unknown entity types are continually involved. Naive fine-tuning will result in catastrophic forgetting on old entity types. Existing continual methods usually depend on knowledge distillation to alleviate forgetting, which are less effective on long task sequences. Moreover, most of them are specific to the class-incremental scenario and cannot adapt to the online scenario, which is more common in practice. In this paper, we propose a unified framework called Contrastive Real-time Updating Prototype (CRUP) that can handle different scenarios for NER. Specifically, we train a Gaussian projection model by a regularized contrastive objective. After training on each batch, we store the mean vectors of representations belong to new entity types as their prototypes. Meanwhile, we update existing prototypes belong to old types only based on representations of the current batch. The final prototypes will be used for the nearest class mean classification. In this way, CRUP can handle different scenarios through its batch-wise learning. Moreover, CRUP can alleviate forgetting in continual scenarios only with current data instead of old data. To comprehensively evaluate CRUP, we construct extensive benchmarks based on various datasets. Experimental results show that CRUP significantly outperforms baselines in continual scenarios and is also competitive in the supervised scenario. \ No newline at end of file diff --git a/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability b/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability new file mode 100644 index 0000000000..e267c45067 --- /dev/null +++ b/data/2024/aaai/Unifying Decision and Function Queries in Stochastic Boolean Satisfiability @@ -0,0 +1 @@ +Stochastic Boolean satisfiability (SSAT) is a natural formalism for optimization under uncertainty. Its decision version implicitly imposes a final threshold quantification on an SSAT formula. However, the single threshold quantification restricts the expressive power of SSAT. In this work, we enrich SSAT with an additional threshold quantifier, resulting in a new formalism SSAT(θ). The increased expressiveness allows SSAT(θ), which remains in the PSPACE complexity class, to subsume and encode the languages in the counting hierarchy. An SSAT(θ) solver, ClauSSat(θ), is developed. Experiments show the applicability of the solver in uniquely solving complex SSAT(θ) instances of parameter synthesis and SSAT extension. \ No newline at end of file diff --git a/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification b/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification new file mode 100644 index 0000000000..8e8cb42a73 --- /dev/null +++ b/data/2024/aaai/Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification @@ -0,0 +1 @@ +Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts. \ No newline at end of file diff --git a/data/2024/aaai/Union Subgraph Neural Networks b/data/2024/aaai/Union Subgraph Neural Networks new file mode 100644 index 0000000000..f718c7734b --- /dev/null +++ b/data/2024/aaai/Union Subgraph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of vanilla GNNs is upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) test as they operate on rooted subtrees through iterative message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructure. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the 1-hop neighborhood of an edge. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Additionally, the local encoding from union subgraphs can also be injected into arbitrary message-passing neural networks (MPNNs) and Transformer-based models as a plugin. Extensive experiments on 18 benchmarks of both graph-level and node-level tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. The injection of our local encoding to existing models is able to boost the performance by up to 11.09%. Our code is available at https://github.com/AngusMonroe/UnionSNN. \ No newline at end of file diff --git a/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect b/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect new file mode 100644 index 0000000000..d8828154ee --- /dev/null +++ b/data/2024/aaai/Unit Selection with Nonbinary Treatment and Effect @@ -0,0 +1 @@ +The unit selection problem aims to identify a set of individuals who are most likely to exhibit a desired mode of behavior or to evaluate the percentage of such individuals in a given population, for example, selecting individuals who would respond one way if encouraged and a different way if not encouraged. Using a combination of experimental and observational data, Li and Pearl solved the binary unit selection problem (binary treatment and effect) by deriving tight bounds on the "benefit function," which is the payoff/cost associated with selecting an individual with given characteristics. This paper extends the benefit function to the general form such that the treatment and effect are not restricted to binary. We then propose an algorithm to test the identifiability of the nonbinary benefit function and an algorithm to compute the bounds of the nonbinary benefit function using experimental and observational data. \ No newline at end of file diff --git a/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus b/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus new file mode 100644 index 0000000000..f13ef5046b --- /dev/null +++ b/data/2024/aaai/United We Stand: Accelerating Privacy-Preserving Neural Inference by Conjunctive Optimization with Interleaved Nexus @@ -0,0 +1 @@ +Privacy-preserving Machine Learning as a Service (MLaaS) enables the powerful cloud server to run its well-trained neural model upon the input from resource-limited client, with both of server's model parameters and client's input data protected. While computation efficiency is critical for the practical implementation of privacy-preserving MLaaS and it is inspiring to witness recent advances towards efficiency improvement, there still exists a significant performance gap to real-world applications. In general, state-of-the-art frameworks perform function-wise efficiency optimization based on specific cryptographic primitives. Although it is logical, such independent optimization for each function makes noticeable amount of expensive operations unremovable and misses the opportunity to further accelerate the performance by jointly considering privacy-preserving computation among adjacent functions. As such, we propose COIN: Conjunctive Optimization with Interleaved Nexus, which remodels mainstream computation for each function to conjunctive counterpart for composite function, with a series of united optimization strategies. Specifically, COIN jointly computes a pair of consecutive nonlinear-linear functions in the neural model by reconstructing the intermediates throughout the whole procedure, which not only eliminates the most expensive crypto operations without invoking extra encryption enabler, but also makes the online crypto complexity independent of filter size. Experimentally, COIN demonstrates 11.2x to 29.6x speedup over various function dimensions from modern networks, and 6.4x to 12x speedup on the total computation time when applied in networks with model input from small-scale CIFAR10 to large-scale ImageNet. \ No newline at end of file diff --git a/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit b/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit new file mode 100644 index 0000000000..1e21f442e8 --- /dev/null +++ b/data/2024/aaai/United We Stand: Using Epoch-Wise Agreement of Ensembles to Combat Overfit @@ -0,0 +1,3 @@ +Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stopping is used. + +To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction - that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit. \ No newline at end of file diff --git a/data/2024/aaai/Universal Weak Coreset b/data/2024/aaai/Universal Weak Coreset new file mode 100644 index 0000000000..1c2648d0bc --- /dev/null +++ b/data/2024/aaai/Universal Weak Coreset @@ -0,0 +1 @@ +Coresets for k-means and k-median problems yield a small summary of the data, which preserves the clustering cost with respect to any set of k centers. Recently coresets have also been constructed for constrained k-means and k-median problems. However, the notion of coresets has the drawback that (i) they can only be applied in settings where the input points are allowed to have weights, and (ii) in general metric spaces, the size of the coresets can depend logarithmically on the number of points. The notion of weak coresets, which has less stringent requirements than coresets, has been studied in the context of classical k-means and k-median problems. A weak coreset is a pair (J,S) of subsets of points, where S acts as a summary of the point set and J as a set of potential centers. This pair satisfies the properties that (i) S is a good summary of the data as long as the k centers are chosen from J only, and (ii) there is a good choice of k centers in J with a cost close to the optimal cost. We develop this framework, which we call universal weak coresets, for constrained clustering settings. In conjunction with recent coreset constructions for constrained settings, our designs give greater data compression, are conceptually simpler, and apply to a wide range of constrained k-median and k-means problems. \ No newline at end of file diff --git a/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data b/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data new file mode 100644 index 0000000000..32837c793f --- /dev/null +++ b/data/2024/aaai/Unknown-Aware Graph Regularization for Robust Semi-supervised Learning from Uncurated Data @@ -0,0 +1 @@ +Recent advances in semi-supervised learning (SSL) have relied on the optimistic assumption that labeled and unlabeled data share the same class distribution. However, this assumption is often violated in real-world scenarios, where unlabeled data may contain out-of-class samples. SSL with such uncurated unlabeled data leads training models to be corrupted. In this paper, we propose a robust SSL method for learning from uncurated real-world data within the context of open-set semi-supervised learning (OSSL). Unlike previous works that rely on feature similarity distance, our method exploits uncertainty in logits. By leveraging task-dependent predictions of logits, our method is capable of robust learning even in the presence of highly correlated outliers. Our key contribution is to present an unknown-aware graph regularization (UAG), a novel technique that enhances the performance of uncertainty-based OSSL frameworks. The technique addresses not only the conflict between training objectives for inliers and outliers but also the limitation of applying the same training rule for all outlier classes, which are existed on previous uncertainty-based approaches. Extensive experiments demonstrate that UAG surpasses state-of-the-art OSSL methods by a large margin across various protocols. Codes are available at https://github.com/heejokong/UAGreg. \ No newline at end of file diff --git a/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning b/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning new file mode 100644 index 0000000000..fb23300636 --- /dev/null +++ b/data/2024/aaai/Unlocking the Power of Open Set: A New Perspective for Open-Set Noisy Label Learning @@ -0,0 +1 @@ +Learning from noisy data has attracted much attention, where most methods focus on closed-set label noise. However, a more common scenario in the real world is the presence of both open-set and closed-set noise. Existing methods typically identify and handle these two types of label noise separately by designing a specific strategy for each type. However, in many real-world scenarios, it would be challenging to identify open-set examples, especially when the dataset has been severely corrupted. Unlike the previous works, we explore how models behave when faced with open-set examples, and find that a part of open-set examples gradually get integrated into certain known classes, which is beneficial for the separation among known classes. Motivated by the phenomenon, we propose a novel two-step contrastive learning method CECL (Class Expansion Contrastive Learning) which aims to deal with both types of label noise by exploiting the useful information of open-set examples. Specifically, we incorporate some open-set examples into closed-set classes to enhance performance while treating others as delimiters to improve representative ability. Extensive experiments on synthetic and real-world datasets with diverse label noise demonstrate the effectiveness of CECL. \ No newline at end of file diff --git a/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game b/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game new file mode 100644 index 0000000000..a38be68c43 --- /dev/null +++ b/data/2024/aaai/Unplugged K-12 AI Learning: Exploring Representation and Reasoning with a Facial Recognition Game @@ -0,0 +1 @@ +With the growing prevalence of AI, the need for K-12 AI education is becoming more crucial, which is prompting active research in developing engaging and age-appropriate AI learning activities. Efforts are underway, such as those by the AI4K12 initiative, to establish guidelines for organizing K- 12 AI education; however, effective instructional resources are needed by educators. In this paper, we describe our work to design, develop, and implement an unplugged activity centered on facial recognition technology for middle school students. Facial recognition is integrated into a wide range of applications throughout daily life, which makes it a familiar and engaging tool for students and an effective medium for conveying AI concepts. Our unplugged activity, “Guess Whose Face,” is designed as a board game that focuses on Representation and Reasoning from AI4K12’s 5 Big Ideas in AI. The game is crafted to enable students to develop AI competencies naturally through physical interaction. In the game, one student uses tracing paper to extract facial features from a familiar face shown on a card, such as a cartoon character or celebrity, and then other students try to guess the identity of the hidden face. We discuss details of the game, its iterative refinement, and initial findings from piloting the activity during a summer camp for rural middle school students. \ No newline at end of file diff --git a/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation b/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation new file mode 100644 index 0000000000..a45cc53975 --- /dev/null +++ b/data/2024/aaai/Unraveling Batch Normalization for Realistic Test-Time Adaptation @@ -0,0 +1 @@ +While recent test-time adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domain shifts unresolved. This paper delves into the problem of mini-batch degradation. By unraveling batch normalization, we discover that the inexact target statistics largely stem from the substantially reduced class diversity in batch. Drawing upon this insight, we introduce a straightforward tool, Test-time Exponential Moving Average (TEMA), to bridge the class diversity gap between training and testing batches. Importantly, our TEMA adaptively extends the scope of typical methods beyond the current batch to incorporate a diverse set of class information, which in turn boosts an accurate target estimation. Built upon this foundation, we further design a novel layer-wise rectification strategy to consistently promote test-time performance. Our proposed method enjoys a unique advantage as it requires neither training nor tuning parameters, offering a truly hassle-free solution. It significantly enhances model robustness against shifted domains and maintains resilience in diverse real-world scenarios with various batch sizes, achieving state-of-the-art performance on several major benchmarks. Code is available at https://github.com/kiwi12138/RealisticTTA. \ No newline at end of file diff --git a/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment b/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment new file mode 100644 index 0000000000..c3f185d5e4 --- /dev/null +++ b/data/2024/aaai/Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment @@ -0,0 +1 @@ +Pain, a primary reason for seeking medical help, requires essential pain assessment for effective management. Studies have recognized electrodermal activity (EDA) signaling's potential for automated pain assessment, but traditional algorithms often ignore the noise and uncertainty inherent in pain data. To address this, we propose a learning framework predicated on data uncertainty, introducing two forms: a) subject-level stimulation-reaction drift; b) ambiguity in self-reporting scores. We formulate an uncertainty assessment using Heart Rate Variability (HRV) features to guide the selection of responsive pain profiles and reweight subtask importance based on the vagueness of self-reported data. These methods are integrated within an end-to-end neural network learning paradigm, focusing the detector on more accurate insights within the uncertainty domain. Extensive experimentation on both the publicly available biovid dataset and the proprietary Apon dataset demonstrates our approach's effectiveness. In the biovid dataset, we achieved a 6% enhancement over the state-of-the-art methodology, and on the Apon dataset, our method outperformed baseline approaches by over 20%. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms b/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms new file mode 100644 index 0000000000..908285e0cd --- /dev/null +++ b/data/2024/aaai/Unsupervised Action Segmentation via Fast Learning of Semantically Consistent Actoms @@ -0,0 +1 @@ +Action segmentation serves as a pivotal component in comprehending videos, encompassing the learning of a sequence of semantically consistent action units known as actoms. Conventional methodologies tend to require a significant consumption of time for both training and learning phases. This paper introduces an innovative unsupervised framework for action segmentation in video, characterized by its fast learning capability and absence of mandatory training. The core idea involves splitting the video into distinct actoms, which are then merging together based on shared actions. The key challenge here is to prevent the inadvertent creation of singular actoms that attempt to represent multiple actions during the splitting phase. Additionally, it is crucial to avoid situations where actoms associated with the same action are incorrectly grouped into multiple clusters during the merging phase. In this paper, we present a method for calculating the similarity between adjacent frames under a subspace assumption. Then, we employ a local minimum searching procedure, which effectively splits the video into coherent actoms aligned with their semantic meaning and provides us an action segmentation proposal. Subsequently, we calculate a spatio-temporal similarity between actoms, followed by developing a merging process to merge actoms representing identical actions within the action segmentation proposals. Our approach is evaluated on four benchmark datasets, and the results demonstrate that our method achieves state-of-the-art performance. Besides, our method also achieves the optimal balance between accuracy and learning time when compared to existing unsupervised techniques. Code is available at https://github.com/y66y/SaM. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt b/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt new file mode 100644 index 0000000000..a162a4114f --- /dev/null +++ b/data/2024/aaai/Unsupervised Continual Anomaly Detection with Contrastively-Learned Prompt @@ -0,0 +1 @@ +Unsupervised Anomaly Detection (UAD) with incremental training is crucial in industrial manufacturing, as unpredictable defects make obtaining sufficient labeled data infeasible. However, continual learning methods primarily rely on supervised annotations, while the application in UAD is limited due to the absence of supervision. Current UAD methods train separate models for different classes sequentially, leading to catastrophic forgetting and a heavy computational burden. To address this issue, we introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD, which equips the UAD with continual learning capability through contrastively-learned prompts. In the proposed UCAD, we design a Continual Prompting Module (CPM) by utilizing a concise key-prompt-knowledge memory bank to guide task-invariant 'anomaly' model predictions using task-specific 'normal' knowledge. Moreover, Structure-based Contrastive Learning (SCL) is designed with the Segment Anything Model (SAM) to improve prompt learning and anomaly segmentation results. Specifically, by treating SAM's masks as structure, we draw features within the same mask closer and push others apart for general feature representations. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation, demonstrating that our method is significantly better than anomaly detection methods, even with rehearsal training. The code will be available at https://github.com/shirowalker/UCAD. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport b/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport new file mode 100644 index 0000000000..65e8c34b65 --- /dev/null +++ b/data/2024/aaai/Unsupervised Cross-Domain Image Retrieval via Prototypical Optimal Transport @@ -0,0 +1 @@ +Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images sharing the same category across diverse domains without relying on labeled data. Prior approaches have typically decomposed the UCIR problem into two distinct tasks: intra-domain representation learning and cross-domain feature alignment. However, these segregated strategies overlook the potential synergies between these tasks. This paper introduces ProtoOT, a novel Optimal Transport formulation explicitly tailored for UCIR, which integrates intra-domain feature representation learning and cross-domain alignment into a unified framework. ProtoOT leverages the strengths of the K-means clustering method to effectively manage distribution imbalances inherent in UCIR. By utilizing K-means for generating initial prototypes and approximating class marginal distributions, we modify the constraints in Optimal Transport accordingly, significantly enhancing its performance in UCIR scenarios. Furthermore, we incorporate contrastive learning into the ProtoOT framework to further improve representation learning. This encourages local semantic consistency among features with similar semantics, while also explicitly enforcing separation between features and unmatched prototypes, thereby enhancing global discriminativeness. ProtoOT surpasses existing state-of-the-art methods by a notable margin across benchmark datasets. Notably, on DomainNet, ProtoOT achieves an average P@200 enhancement of 24.44%, and on Office-Home, it demonstrates a P@15 improvement of 12.12%. Code is available at https://github.com/HCVLAB/ProtoOT. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization b/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization new file mode 100644 index 0000000000..67404bd7b9 --- /dev/null +++ b/data/2024/aaai/Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization @@ -0,0 +1 @@ +Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies b/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies new file mode 100644 index 0000000000..7541edfbf6 --- /dev/null +++ b/data/2024/aaai/Unsupervised Extractive Summarization with Learnable Length Control Strategies @@ -0,0 +1 @@ +Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require high-quality human-labelled summaries for training and thus can be easily applied for documents with different types, domains or languages. Most of existing unsupervised methods including TextRank and PACSUM rely on graph-based ranking on sentence centrality. However, this scorer can not be directly applied in end-to-end training, and the positional-related prior assumption is often needed for achieving good summaries. In addition, less attention is paid to length-controllable extractor, where users can decide to summarize texts under particular length constraint. This paper introduces an unsupervised extractive summarization model based on a siamese network, for which we develop a trainable bidirectional prediction objective between the selected summary and the original document. Different from the centrality-based ranking methods, our extractive scorer can be trained in an end-to-end manner, with no other requirement of positional assumption. In addition, we introduce a differentiable length control module by approximating 0-1 knapsack solver for end-to-end length-controllable extracting. Experiments show that our unsupervised method largely outperforms the centrality-based baseline using a same sentence encoder. In terms of length control ability, via our trainable knapsack module, the performance consistently outperforms the strong baseline without utilizing end-to-end training. Human evaluation further evidences that our method performs the best among baselines in terms of relevance and consistency. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport b/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport new file mode 100644 index 0000000000..519c1e80e0 --- /dev/null +++ b/data/2024/aaai/Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport @@ -0,0 +1 @@ +Cell type identification plays a vital role in single-cell RNA sequencing (scRNA-seq) data analysis. Although many deep embedded methods to cluster scRNA-seq data have been proposed, they still fail in elucidating the intrinsic properties of cells and genes. Here, we present a novel end-to-end deep graph clustering model for single-cell transcriptomics data based on unsupervised Gene-Cell Collective representation learning and Optimal Transport (scGCOT) which integrates both cell and gene correlations. Specifically, scGCOT learns the latent embedding of cells and genes simultaneously and reconstructs the cell graph, the gene graph, and the gene expression count matrix. A zero-inflated negative binomial (ZINB) model is estimated via the reconstructed count matrix to capture the essential properties of scRNA-seq data. By leveraging the optimal transport-based joint representation alignment, scGCOT learns the clustering process and the latent representations through a mutually supervised self optimization strategy. Extensive experiments with 14 competing methods on 15 real scRNA-seq datasets demonstrate the competitive edges of scGCOT. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning b/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning new file mode 100644 index 0000000000..8bf44b6585 --- /dev/null +++ b/data/2024/aaai/Unsupervised Group Re-identification via Adaptive Clustering-Driven Progressive Learning @@ -0,0 +1 @@ +Group re-identification (G-ReID) aims to correctly associate groups with the same members captured by different cameras. However, supervised approaches for this task often suffer from the high cost of cross-camera sample labeling. Unsupervised methods based on clustering can avoid sample labeling, but the problem of member variations often makes clustering unstable, leading to incorrect pseudo-labels. To address these challenges, we propose an adaptive clustering-driven progressive learning approach (ACPL), which consists of a group adaptive clustering (GAC) module and a global dynamic prototype update (GDPU) module. Specifically, GAC designs the quasi-distance between groups, thus fully capitalizing on both individual-level and holistic information within groups. In the case of great uncertainty in intra-group members, GAC effectively minimizes the impact of non-discriminative features and reduces the noise in the model's pseudo-labels. Additionally, our GDPU devises a dynamic weight to update the prototypes and effectively mine the hard samples with complex member variations, which improves the model's robustness. Extensive experiments conducted on four popular G-ReID datasets demonstrate that our method not only achieves state-of-the-art performance on unsupervised G-ReID but also performs comparably to several fully supervised approaches. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection b/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection new file mode 100644 index 0000000000..c5e0c3b91c --- /dev/null +++ b/data/2024/aaai/Unsupervised Layer-Wise Score Aggregation for Textual OOD Detection @@ -0,0 +1 @@ +Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on anomaly scores (\textit{e.g.}, Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results can be achieved, provided that an oracle selects the best layer. We propose a data-driven, unsupervised method to leverage this observation to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a more significant number of classes (up to 150), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results comparable to using the best layer according to an oracle while removing manual feature selection altogether. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification b/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification new file mode 100644 index 0000000000..c3faa9f8c2 --- /dev/null +++ b/data/2024/aaai/Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification @@ -0,0 +1 @@ +We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models b/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models new file mode 100644 index 0000000000..55d7fe36dc --- /dev/null +++ b/data/2024/aaai/Unsupervised Object Interaction Learning with Counterfactual Dynamics Models @@ -0,0 +1 @@ +We present COIL (Counterfactual Object Interaction Learning), a novel way of learning skills of object interactions on entity-centric environments. The goal is to learn primitive behaviors that can induce interactions without external reward or any supervision. Existing skill discovery methods are limited to locomotion, simple navigation tasks, or single-object manipulation tasks, mostly not inducing interaction between objects. Unlike a monolithic representation usually used in prior skill learning methods, we propose to use a structured goal representation that can query and scope which objects to interact with, which can serve as a basis for solving more complex downstream tasks. We design a novel counterfactual intrinsic reward through the use of either a forward model or successor features that can learn an interaction skill between a pair of objects given as a goal. Through experiments on continuous control environments such as Magnetic Block and 2.5-D Stacking Box, we demonstrate that an agent can learn object interaction behaviors (e.g., attaching or stacking one block to another) without any external rewards or domain-specific knowledge. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration b/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration new file mode 100644 index 0000000000..2c5e30a541 --- /dev/null +++ b/data/2024/aaai/Unsupervised Pan-Sharpening via Mutually Guided Detail Restoration @@ -0,0 +1 @@ +Pan-sharpening is a task that aims to super-resolve the low-resolution multispectral (LRMS) image with the guidance of a corresponding high-resolution panchromatic (PAN) image. The key challenge in pan-sharpening is to accurately modeling the relationship between the MS and PAN images. While supervised deep learning methods are commonly employed to address this task, the unavailability of ground-truth severely limits their effectiveness. In this paper, we propose a mutually guided detail restoration method for unsupervised pan-sharpening. Specifically, we treat pan-sharpening as a blind image deblurring task, in which the blur kernel can be estimated by a CNN. Constrained by the blur kernel, the pan-sharpened image retains spectral information consistent with the LRMS image. Once the pan-sharpened image is obtained, the PAN image is blurred using a pre-defined blur operator. The pan-sharpened image, in turn, is used to guide the detail restoration of the blurred PAN image. By leveraging the mutual guidance between MS and PAN images, the pan-sharpening network can implicitly learn the spatial relationship between the two modalities. Extensive experiments show that the proposed method significantly outperforms existing unsupervised pan-sharpening methods. \ No newline at end of file diff --git a/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training b/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training new file mode 100644 index 0000000000..cb536f72c2 --- /dev/null +++ b/data/2024/aaai/Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training @@ -0,0 +1 @@ +To train generalizable Reinforcement Learning (RL) agents, researchers recently proposed the Unsupervised Environment Design (UED) framework, in which a teacher agent creates a very large number of training environments and a student agent trains on the experiences in these environments to be robust against unseen testing scenarios. For example, to train a student to master the “stepping over stumps” task, the teacher will create numerous training environments with varying stump heights and shapes. In this paper, we argue that UED neglects training efficiency and its need for very large number of environments (henceforth referred to as infinite horizon training) makes it less suitable to training robots and non-expert humans. In real-world applications where either creating new training scenarios is expensive or training efficiency is of critical importance, we want to maximize both the learning efficiency and learning outcome of the student. To achieve efficient finite horizon training, we propose a novel Markov Decision Process (MDP) formulation for the teacher agent, referred to as Unsupervised Training Sequence Design (UTSD). Specifically, we encode salient information from the student policy (e.g., behaviors and learning progress) into the teacher's state space, enabling the teacher to closely track the student's learning progress and consequently discover the optimal training sequences with finite lengths. Additionally, we explore the teacher's efficient adaptation to unseen students at test time by employing the context-based meta-learning approach, which leverages the teacher's past experiences with various students. Finally, we empirically demonstrate our teacher's capability to design efficient and effective training sequences for students with varying capabilities. \ No newline at end of file diff --git a/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement b/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement new file mode 100644 index 0000000000..dc6956b120 --- /dev/null +++ b/data/2024/aaai/Unveiling Details in the Dark: Simultaneous Brightening and Zooming for Low-Light Image Enhancement @@ -0,0 +1 @@ +Existing super-resolution methods exhibit limitations when applied to nighttime scenes, primarily due to their lack of adaptation to low-pair dynamic range and noise-heavy dark-light images. In response, this research introduces an innovative customized framework to simultaneously Brighten and Zoom in low-resolution images captured in low-light conditions, dubbed BrZoNet. The core method begins by feeding low-light, low-resolution images, and their corresponding ground truths into the Retinex-induced siamese decoupling network. This process yields distinct reflectance maps and illuminance maps, guided by supervision from the ground truth’s decomposition maps. Subsequently, these reflectance and illuminance maps transition into an intricate super-resolution sub-network. This sub-network employs a meticulously designed cross-layer content-aware interactor - Illumination-aware Interaction Unit(IaIU), elegantly endowed with a gating mechanism. The IaIU facilitates meaningful feature interaction between illuminance and reflectance features while effectively reducing unwanted noise. An intricate super-resolution cage is also constructed to comprehensively integrate information, ultimately resulting in the generation of high-resolution images featuring intricate details. Thorough and diverse experiments validate the superiority of the proposed BrZoNet, surpassing contemporary cutting-edge technologies by proficiently augmenting brightness and intricately recovering complex details, showcasing advancements of 7.1% in PSNR, 2.4% in SSIM, and an impressive 36.8% in LPIPS metrics. \ No newline at end of file diff --git a/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning b/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning new file mode 100644 index 0000000000..a16a0379b5 --- /dev/null +++ b/data/2024/aaai/Unveiling Implicit Deceptive Patterns in Multi-Modal Fake News via Neuro-Symbolic Reasoning @@ -0,0 +1 @@ +In the current Internet landscape, the rampant spread of fake news, particularly in the form of multi-modal content, poses a great social threat. While automatic multi-modal fake news detection methods have shown promising results, the lack of explainability remains a significant challenge. Existing approaches provide superficial explainability by displaying learned important components or views from well-trained networks, but they often fail to uncover the implicit deceptive patterns that reveal how fake news is fabricated. To address this limitation, we begin by predefining three typical deceptive patterns, namely image manipulation, cross-modal inconsistency, and image repurposing, which shed light on the mechanisms underlying fake news fabrication. Then, we propose a novel Neuro-Symbolic Latent Model called NSLM, that not only derives accurate judgments on the veracity of news but also uncovers the implicit deceptive patterns as explanations. Specifically, the existence of each deceptive pattern is expressed as a two-valued learnable latent variable, which is acquired through amortized variational inference and weak supervision based on symbolic logic rules. Additionally, we devise pseudo-siamese networks to capture distinct deceptive patterns effectively. Experimental results on two real-world datasets demonstrate that our NSLM achieves the best performance in fake news detection while providing insightful explanations of deceptive patterns. \ No newline at end of file diff --git a/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning b/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning new file mode 100644 index 0000000000..83d1452f1b --- /dev/null +++ b/data/2024/aaai/Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning @@ -0,0 +1 @@ +Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models. \ No newline at end of file diff --git a/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability b/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability new file mode 100644 index 0000000000..1aea471009 --- /dev/null +++ b/data/2024/aaai/Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability @@ -0,0 +1 @@ +Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven distinct metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models (e.g., SVM) coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models. \ No newline at end of file diff --git a/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering b/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering new file mode 100644 index 0000000000..4e12754218 --- /dev/null +++ b/data/2024/aaai/Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational Clustering @@ -0,0 +1 @@ +Multi-relational clustering is a challenging task due to the fact that diverse semantic information conveyed in multi-layer graphs is difficult to extract and fuse. Recent methods integrate topology structure and node attribute information through graph filtering. However, they often use a low-pass filter without fully considering the correlation among multiple graphs. To overcome this drawback, we propose to learn a graph filter motivated by the theoretical analysis of Barlow Twins. We find that input with a negative semi-definite inner product provides a lower bound for Barlow Twins loss, which prevents it from reaching a better solution. We thus learn a filter that yields an upper bound for Barlow Twins. Afterward, we design a simple clustering architecture and demonstrate its state-of-the-art performance on four benchmark datasets. The source code is available at https://github.com/XweiQ/BTGF. \ No newline at end of file diff --git a/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction b/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction new file mode 100644 index 0000000000..e1cfc96750 --- /dev/null +++ b/data/2024/aaai/Urban Region Embedding via Multi-View Contrastive Prediction @@ -0,0 +1 @@ +Recently, learning urban region representations utilizing multi-modal data (information views) has become increasingly popular, for deep understanding of the distributions of various socioeconomic features in cities. However, previous methods usually blend multi-view information in a posteriors stage, falling short in learning coherent and consistent representations across different views. In this paper, we form a new pipeline to learn consistent representations across varying views, and propose the multi-view Contrastive Prediction model for urban Region embedding (ReCP), which leverages the multiple information views from point-of-interest (POI) and human mobility data. Specifically, ReCP comprises two major modules, namely an intra-view learning module utilizing contrastive learning and feature reconstruction to capture the unique information from each single view, and inter-view learning module that perceives the consistency between the two views using a contrastive prediction learning scheme. We conduct thorough experiments on two downstream tasks to assess the proposed model, i.e., land use clustering and region popularity prediction. The experimental results demonstrate that our model outperforms state-of-the-art baseline methods significantly in urban region representation learning. \ No newline at end of file diff --git a/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health b/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health new file mode 100644 index 0000000000..be3d1f3f08 --- /dev/null +++ b/data/2024/aaai/Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health @@ -0,0 +1 @@ +Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains. \ No newline at end of file diff --git a/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models b/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models new file mode 100644 index 0000000000..4239f77e70 --- /dev/null +++ b/data/2024/aaai/Using Artificial Populations to Study Psychological Phenomena in Neural Models @@ -0,0 +1 @@ +The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of human-like cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to be meaningful. We leverage work in uncertainty estimation in a novel approach to efficiently construct experimental populations. The resultant tool, PopulationLM, has been made open source. We provide theoretical grounding in the uncertainty estimation literature and motivation from current cognitive work regarding language models. We discuss the methodological lessons from other scientific communities and attempt to demonstrate their application to two artificial population studies. Through population based experimentation we find that language models exhibit behavior consistent with typicality effects among categories highly represented in training. However, we find that language models don't tend to exhibit structural priming effects. Generally, our results show that single models tend to over estimate the presence of cognitive behaviors in neural models. \ No newline at end of file diff --git a/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization b/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization new file mode 100644 index 0000000000..adfa5ced9e --- /dev/null +++ b/data/2024/aaai/Using Clustering to Strengthen Decision Diagram Bounds for Discrete Optimization @@ -0,0 +1 @@ +Offering a generic approach to obtaining both upper and lower bounds, decision diagrams (DDs) are becoming an increasingly important tool for solving discrete optimization problems. In particular, they provide a powerful and often complementary alternative to other well-known generic bounding mechanisms such as the LP relaxation. A standard approach to employ DDs for discrete optimization is to formulate the problem as a Dynamic Program and use that formulation to compile a DD top-down in a layer-by-layer fashion. To limit the size of the resulting DD and to obtain bounds, one typically imposes a maximum width for each layer which is then enforced by either merging nodes (resulting in a so-called relaxed DD that provides a dual bound) or by dropping nodes (resulting in a so-called restricted DD that provides a primal bound). The quality of the DD bounds obtained from this top-down compilation process heavily depends on the heuristics used for the selection of the nodes to merge or drop. While it is sometimes possible to engineer problem-specific heuristics for this selection problem, the most generic approach relies on sorting the layer’s nodes based on objective function information. In this paper, we propose a generic and problem-agnostic approach that relies on clustering nodes based on the state information associated with each node. In a set of computational experiments with different knapsack and scheduling problems, we show that our approach generally outperforms the classical generic approach, and often achieves drastically better bounds both with respect to the size of the DD and the time used for compiling the DD. \ No newline at end of file diff --git a/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data b/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data new file mode 100644 index 0000000000..c0e26a269b --- /dev/null +++ b/data/2024/aaai/Using Reinforcement Learning to Iteratively Construct Road Networks from Satellite Images and GPS Data @@ -0,0 +1 @@ +Constructing road networks manually is a time consuming and labor-intensive process. This paper proposes a new method to iteratively construct road networks using reinforcement learning from a combined tensor-based representation of satellite image and GPS trajectory data. \ No newline at end of file diff --git a/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations b/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations new file mode 100644 index 0000000000..85f67b04de --- /dev/null +++ b/data/2024/aaai/Using Stratified Sampling to Improve LIME Image Explanations @@ -0,0 +1,5 @@ +We investigate the use of a stratified sampling approach for LIME Image, a popular model-agnostic explainable AI method for computer vision tasks, in order to reduce the artifacts generated by typical Monte Carlo sampling. +Such artifacts are due to the undersampling of the dependent variable in the synthetic neighborhood around the image being explained, which may result in inadequate explanations due to the impossibility of fitting a linear regressor on the sampled data. +We then highlight a connection with the Shapley theory, where similar arguments about undersampling and sample relevance were suggested in the past. +We derive all the formulas and adjustment factors required for an unbiased stratified sampling estimator. +Experiments show the efficacy of the proposed approach. \ No newline at end of file diff --git a/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking b/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking new file mode 100644 index 0000000000..1f964809e6 --- /dev/null +++ b/data/2024/aaai/Using Symmetries to Lift Satisfiability Checking @@ -0,0 +1,4 @@ +We analyze how symmetries can be used to compress structures (also known as interpretations) onto a smaller domain without loss of information. This analysis suggests the possibility to solve satisfiability problems in the compressed domain for better performance. Thus, we propose a 2-step novel method: (i) the sentence to be satisfied is automatically translated into an equisatisfiable sentence over a ``lifted'' vocabulary that allows domain compression; (ii) satisfiability of the lifted sentence is checked by growing the (initially unknown) compressed domain until a satisfying structure is found. +The key issue is to ensure that this satisfying structure can always be expanded into an uncompressed structure that satisfies the original sentence to be satisfied. + +We present an adequate translation for sentences in typed first-order logic extended with aggregates. Our experimental evaluation shows large speedups for generative configuration problems. The method also has applications in the verification of software operating on complex data structures. Our results justify further research in automatic translation of sentences for symmetry reduction. \ No newline at end of file diff --git a/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models b/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models new file mode 100644 index 0000000000..ddc4bd0267 --- /dev/null +++ b/data/2024/aaai/V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models @@ -0,0 +1 @@ +Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation b/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation new file mode 100644 index 0000000000..35b0f93492 --- /dev/null +++ b/data/2024/aaai/V2Meow: Meowing to the Visual Beat via Video-to-Music Generation @@ -0,0 +1 @@ +Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. \ No newline at end of file diff --git a/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation b/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation new file mode 100644 index 0000000000..2afca6b705 --- /dev/null +++ b/data/2024/aaai/VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication Recommendation @@ -0,0 +1 @@ +We address the medication recommendation problem, which aims to recommend effective medications for a patient's current visit by utilizing information (e.g., diagnoses and procedures) given at the patient's current and past visits. While there exist a number of recommender systems designed for this problem, we point out that they are challenged in accurately capturing the relation (spec., the degree of relevance) between the current and each of the past visits for the patient when obtaining her current health status, which is the basis for recommending medications. To address this limitation, we propose a novel medication recommendation framework, named VITA, based on the following two novel ideas: (1) relevant-Visit selectIon; (2) Target-aware Attention. Through extensive experiments using real-world datasets, we demonstrate the superiority of VITA (spec., up to 5.67% higher accuracy, in terms of Jaccard, than the best competitor) and the effectiveness of its two core ideas. The code is available at https://github.com/jhheo0123/VITA. \ No newline at end of file diff --git a/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning b/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning new file mode 100644 index 0000000000..f8f014228c --- /dev/null +++ b/data/2024/aaai/VIXEN: Visual Text Comparison Network for Image Difference Captioning @@ -0,0 +1 @@ +We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen \ No newline at end of file diff --git a/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding b/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding new file mode 100644 index 0000000000..45b2c55c8d --- /dev/null +++ b/data/2024/aaai/VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding @@ -0,0 +1 @@ +Vision and language foundation models (VLMs) have showcased impressive capabilities in 2D scene understanding. However, their latent potential in elevating the understanding of 3D autonomous driving scenes remains untapped. In this paper, we propose VLM2Scene, which exploits the potential of VLMs to enhance 3D self-supervised representation learning through our proposed image-text-LiDAR contrastive learning strategy. Specifically, in the realm of autonomous driving scenes, the inherent sparsity of LiDAR point clouds poses a notable challenge for point-level contrastive learning methods. This method often grapples with limitations tied to a restricted receptive field and the presence of noisy points. To tackle this challenge, our approach emphasizes region-level learning, leveraging regional masks without semantics derived from the vision foundation model. This approach capitalizes on valuable contextual information to enhance the learning of point cloud representations. First, we introduce Region Caption Prompts to generate fine-grained language descriptions for the corresponding regions, utilizing the language foundation model. These region prompts then facilitate the establishment of positive and negative text-point pairs within the contrastive loss framework. Second, we propose a Region Semantic Concordance Regularization, which involves a semantic-filtered region learning and a region semantic assignment strategy. The former aims to filter the false negative samples based on the semantic distance, and the latter mitigates potential inaccuracies in pixel semantics, thereby enhancing overall semantic consistency. Extensive experiments on representative autonomous driving datasets demonstrate that our self-supervised method significantly outperforms other counterparts. Codes are available at https://github.com/gbliao/VLM2Scene. \ No newline at end of file diff --git a/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation b/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation new file mode 100644 index 0000000000..19f87bf765 --- /dev/null +++ b/data/2024/aaai/VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation @@ -0,0 +1 @@ +Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded non-repetitive navigation instructions, combined with an image rotation similarity based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigation agent when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset. \ No newline at end of file diff --git a/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers b/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers new file mode 100644 index 0000000000..f056675dcc --- /dev/null +++ b/data/2024/aaai/VPDETR: End-to-End Vanishing Point DEtection TRansformers @@ -0,0 +1 @@ +In the field of vanishing point detection, previous works commonly relied on extracting and clustering straight lines or classifying candidate points as vanishing points. This paper proposes a novel end-to-end framework, called VPDETR (Vanishing Point DEtection TRansformer), that views vanishing point detection as a set prediction problem, applicable to both Manhattan and non-Manhattan world datasets. By using the positional embedding of anchor points as queries in Transformer decoders and dynamically updating them layer by layer, our method is able to directly input images and output their vanishing points without the need for explicit straight line extraction and candidate points sampling. Additionally, we introduce an orthogonal loss and a cross-prediction loss to improve accuracy on the Manhattan world datasets. Experimental results demonstrate that VPDETR achieves competitive performance compared to state-of-the-art methods, without requiring post-processing. \ No newline at end of file diff --git a/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization b/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization new file mode 100644 index 0000000000..f5cff4d9a7 --- /dev/null +++ b/data/2024/aaai/VQ-FONT: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization @@ -0,0 +1 @@ +Few-shot font generation is challenging, as it needs to capture the fine-grained stroke styles from a limited set of reference glyphs, and then transfer to other characters, which are expected to have similar styles. However, due to the diversity and complexity of Chinese font styles, the synthesized glyphs of existing methods usually exhibit visible artifacts, such as missing details and distorted strokes. In this paper, we propose a VQGAN-based framework (i.e., VQ-Font) to enhance glyph fidelity through token prior refinement and structure-aware enhancement. Specifically, we pre-train a VQGAN to encapsulate font token prior within a code-book. Subsequently, VQ-Font refines the synthesized glyphs with the codebook to eliminate the domain gap between synthesized and real-world strokes. Furthermore, our VQ-Font leverages the inherent design of Chinese characters, where structure components such as radicals and character components are combined in specific arrangements, to recalibrate fine-grained styles based on references. This process improves the matching and fusion of styles at the structure level. Both modules collaborate to enhance the fidelity of the generated fonts. Experiments on a collected font dataset show that our VQ-Font outperforms the competing methods both quantitatively and qualitatively, especially in generating challenging styles. Our code is available at https://github.com/Yaomingshuai/VQ-Font. \ No newline at end of file diff --git a/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models b/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models new file mode 100644 index 0000000000..eacf9bc3ac --- /dev/null +++ b/data/2024/aaai/VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models @@ -0,0 +1,2 @@ +Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pre-training & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work reveals +a significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link https://github.com/ericyinyzy/VQAttack. \ No newline at end of file diff --git a/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook b/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook new file mode 100644 index 0000000000..a207f11eba --- /dev/null +++ b/data/2024/aaai/VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook @@ -0,0 +1,3 @@ +Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors. +In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders. +Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR. \ No newline at end of file diff --git a/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression b/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression new file mode 100644 index 0000000000..edca879af7 --- /dev/null +++ b/data/2024/aaai/VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression @@ -0,0 +1 @@ +In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-based studies. In this paper, we show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches. Furthermore, we propose a Video-to-Video Suppression network (VVS) as a solution. VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames. This structure is intended to effectively describe an untrimmed video with varying content and meaningless information. Its efficacy is proved via extensive experiments, and we show that our approach is not only state-of-the-art in video-level approaches but also has a fast inference time despite possessing retrieval capabilities close to those of frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS \ No newline at end of file diff --git a/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models b/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models new file mode 100644 index 0000000000..98612fa01b --- /dev/null +++ b/data/2024/aaai/Validation, Robustness, and Accuracy of Perturbation-Based Sensitivity Analysis Methods for Time-Series Deep Learning Models @@ -0,0 +1 @@ +This work undertakes studies to evaluate Interpretability Methods for Time Series Deep Learning. Sensitivity analysis assesses how input changes affect the output, constituting a key component of interpretation. Among the post-hoc interpretation methods such as back-propagation, perturbation, and approximation, my work will investigate perturbation-based sensitivity Analysis methods on modern Transformer models to benchmark their performances. Specifically, my work intends to answer three research questions: 1) Do different sensitivity analysis methods yield comparable outputs and attribute importance rankings? 2) Using the same sensitivity analysis method, do different Deep Learning models impact the output of the sensitivity analysis? 3) How well do the results from sensitivity analysis methods align with the ground truth? \ No newline at end of file diff --git a/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties b/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties new file mode 100644 index 0000000000..a91324512f --- /dev/null +++ b/data/2024/aaai/Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties @@ -0,0 +1,5 @@ +Human values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts. To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties as well as their interaction. + +We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. ValuePrism’s contextualized values are generated by GPT-4 and deemed high-quality by human annotators 91% of the time. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented. + +With ValuePrism, we build Value Kaleidoscope (or Kaleido), an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT- 4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values. Finally, we show that Kaleido’s representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. We hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering AI systems to make decisions that are more in accordance with them. \ No newline at end of file diff --git a/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks b/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks new file mode 100644 index 0000000000..9103a07ad8 --- /dev/null +++ b/data/2024/aaai/Value at Adversarial Risk: A Graph Defense Strategy against Cost-Aware Attacks @@ -0,0 +1 @@ +Deep learning methods on graph data have achieved remarkable efficacy across a variety of real-world applications, such as social network analysis and transaction risk detection. Nevertheless, recent studies have illuminated a concerning fact: even the most expressive Graph Neural Networks (GNNs) are vulnerable to graph adversarial attacks. While several methods have been proposed to enhance the robustness of GNN models against adversarial attacks, few have focused on a simple yet realistic approach: valuing the adversarial risks and focused safeguards at the node level. This empowers defenders to allocate heightened security level to vulnerable nodes, while lower to robust nodes. With this new perspective, we propose a novel graph defense strategy RisKeeper, such that the adversarial risk can be directly kept in the input graph. We start at valuing the adversarial risk, by introducing a cost-aware projected gradient descent attack that takes into account both cost avoidance and compliance with costs budgets. Subsequently, we present a learnable approach to ascertain the ideal security level for each individual node by solving a bi-level optimization problem. Through extensive experiments on four real-world datasets, we demonstrate that our method achieves superior performance surpassing state-of-the-art methods. Our in-depth case studies provide further insights into vulnerable and robust structural patterns, serving as inspiration for practitioners to exercise heightened vigilance. \ No newline at end of file diff --git a/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping b/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping new file mode 100644 index 0000000000..cdedfe9d02 --- /dev/null +++ b/data/2024/aaai/Variable Importance in High-Dimensional Settings Requires Grouping @@ -0,0 +1 @@ +Explaining the decision process of machine learning algorithms is nowadays crucial for both model’s performance enhancement and human comprehension. This can be achieved by assessing the variable importance of single variables, even for high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While only removal-based approaches, such as Permutation Importance (PI), can bring statistical validity, they return misleading results when variables are correlated. Conditional Permutation Importance (CPI) bypasses PI’s limitations in such cases. However, in high-dimensional settings, where high correlations between the variables cancel their conditional importance, the use of CPI as well as other methods leads to unreliable results, besides prohibitive computation costs. Grouping variables statistically via clustering or some prior knowledge gains some power back and leads to better interpretations. In this work, we introduce BCPI (Block-Based Conditional Permutation Importance), a new generic framework for variable importance computation with statistical guarantees handling both single and group cases. Furthermore, as handling groups with high cardinality (such as a set of observations of a given modality) are both time-consuming and resource-intensive, we also introduce a new stacking approach extending the DNN architecture with sub-linear layers adapted to the group structure. We show that the ensuing approach extended with stacking controls the type-I error even with highly-correlated groups and shows top accuracy across benchmarks. Furthermore, we perform a real-world data analysis in a large-scale medical dataset where we aim to show the consistency between our results and the literature for a biomarker prediction. \ No newline at end of file diff --git a/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation b/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation new file mode 100644 index 0000000000..1d785cd244 --- /dev/null +++ b/data/2024/aaai/Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation @@ -0,0 +1 @@ +Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing. However, fully extracting the target mask with limited user inputs remains challenging. We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs. Regarding the last segmentation result as the initial mask, an iterative refinement process is commonly employed to continually enhance the initial mask. Nevertheless, conventional techniques suffer from sensitivity to the variance in the initial mask. To circumvent this problem, our proposed method incorporates a mask matching algorithm for ensuring consistent inferences from different types of initial masks. We also introduce a target-aware zooming algorithm to preserve object information during downsampling, balancing efficiency and accuracy. Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation. \ No newline at end of file diff --git a/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection b/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection new file mode 100644 index 0000000000..2fbe86142c --- /dev/null +++ b/data/2024/aaai/Variational Hybrid-Attention Framework for Multi-Label Few-Shot Aspect Category Detection @@ -0,0 +1 @@ +Multi-label few-shot aspect category detection (FS-ACD) is a challenging sentiment analysis task, which aims to learn a multi-label learning paradigm with limited training data. The difficulty of this task is how to use limited data to generalize effective discriminative representations for different categories. Nowadays, all advanced FS-ACD works utilize the prototypical network to learn label prototypes to represent different aspects. However, such point-based estimation methods are inherently noise-susceptible and bias-vulnerable. To this end, this paper proposes a novel Variational Hybrid-Attention Framework (VHAF) for the FS-ACD task. Specifically, to alleviate the data noise, we adopt a hybrid-attention mechanism to generate more discriminative aspect-specific embeddings. Then, based on these embeddings, we introduce the variational distribution inference to obtain the aspect-specific distribution as a more robust aspect representation, which can eliminate the scarce data bias for better inference. Moreover, we further leverage an adaptive threshold estimation to help VHAF better identify multiple relevant aspects. Extensive experiments on three datasets demonstrate the effectiveness of our VHAF over other state-of-the-art methods. Code is available at https://github.com/chengzju/VHAF. \ No newline at end of file diff --git a/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation b/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation new file mode 100644 index 0000000000..3db63d791b --- /dev/null +++ b/data/2024/aaai/Vector Field Oriented Diffusion Model for Crystal Material Generation @@ -0,0 +1 @@ +Discovering crystal structures with specific chemical properties has become an increasingly important focus in material science. However, current models are limited in their ability to generate new crystal lattices, as they only consider atomic positions or chemical composition. To address this issue, we propose a probabilistic diffusion model that utilizes a geometrically equivariant GNN to consider atomic positions and crystal lattices jointly. To evaluate the effectiveness of our model, we introduce a new generation metric inspired by Frechet Inception Distance, but based on GNN energy prediction rather than InceptionV3 used in computer vision. In addition to commonly used metrics like validity, which assesses the plausibility of a structure, this new metric offers a more comprehensive evaluation of our model's capabilities. Our experiments on existing benchmarks show the significance of our diffusion model. We also show that our method can effectively learn meaningful representations. \ No newline at end of file diff --git a/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch b/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch new file mode 100644 index 0000000000..7cb8ce8afc --- /dev/null +++ b/data/2024/aaai/VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch @@ -0,0 +1 @@ +AI's widespread integration has led to neural networks (NN) deployment on edge and similar limited-resource platforms for safety-critical scenarios. Yet, NN's fragility raises concerns about reliable inference. Moreover, constrained platforms demand compact networks. This study introduces VeriCompress, a tool that automates the search and training of compressed models with robustness guarantees. These models are well-suited for safety-critical applications and adhere to predefined architecture and size limitations, making them deployable on resource-restricted platforms. The method trains models 2-3 times faster than the state-of-the-art approaches, surpassing them by average accuracy and robustness gains of 15.1 and 9.8 percentage points, respectively. When deployed on a resource-restricted generic platform, these models require 5-8 times less memory and 2-4 times less inference time than models used in verified robustness literature. Our comprehensive evaluation across various model architectures and datasets, including MNIST, CIFAR, SVHN, and a relevant pedestrian detection dataset, showcases VeriCompress's capacity to identify compressed verified robust models with reduced computation overhead compared to current standards. This underscores its potential as a valuable tool for end users, such as developers of safety-critical applications on edge or Internet of Things platforms, empowering them to create suitable models for safety-critical, resource-constrained platforms in their respective domains. \ No newline at end of file diff --git a/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization b/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization new file mode 100644 index 0000000000..30e33363c7 --- /dev/null +++ b/data/2024/aaai/ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization @@ -0,0 +1 @@ +Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost. \ No newline at end of file diff --git a/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis b/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis new file mode 100644 index 0000000000..2f8d6c32b9 --- /dev/null +++ b/data/2024/aaai/ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis @@ -0,0 +1 @@ +The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to low-level annotations like ball trajectories, overlooking tactics that necessitate an understanding of stroke techniques. State-of-the-art action segmentation models also struggle with technique recognition due to frequent occlusions and motion-induced blurring in racket sports videos. To address these challenges, We propose ViSTec, a Video-based Sports Technique recognition model inspired by human cognition that synergizes sparse visual data with rich contextual insights. Our approach integrates a graph to explicitly model strategic knowledge in stroke sequences and enhance technique recognition with contextual inductive bias. A two-stage action perception model is jointly trained to align with the contextual knowledge in the graph. Experiments demonstrate that our method outperforms existing models by a significant margin. Case studies with experts from the Chinese national table tennis team validate our model's capacity to automate analysis for technical actions and tactical strategies. More details are available at: https://ViSTec2024.github.io/. \ No newline at end of file diff --git a/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer b/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer new file mode 100644 index 0000000000..de52f1ef9e --- /dev/null +++ b/data/2024/aaai/ViT-Calibrator: Decision Stream Calibration for Vision Transformer @@ -0,0 +1 @@ +A surge of interest has emerged in utilizing Transformers in diverse vision tasks owing to its formidable performance. However, existing approaches primarily focus on optimizing internal model architecture designs that often entail significant trial and error with high burdens. In this work, we propose a new paradigm dubbed Decision Stream Calibration that boosts the performance of general Vision Transformers. To achieve this, we shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions. Upon further analysis, it was discovered that 1) the final decision is associated with tokens of foreground targets, while token features of foreground target will be transmitted into the next layer as much as possible, and the useless token features of background area will be eliminated gradually in the forward propagation. 2) Each category is solely associated with specific sparse dimensions in the tokens. Based on the discoveries mentioned above, we designed a two-stage calibration scheme, namely ViT-Calibrator, including token propagation calibration stage and dimension propagation calibration stage. Extensive experiments on commonly used datasets show that the proposed approach can achieve promising results. \ No newline at end of file diff --git a/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining b/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining new file mode 100644 index 0000000000..da4ecba5e8 --- /dev/null +++ b/data/2024/aaai/ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining @@ -0,0 +1 @@ +Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods adopt convolutional architectures while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, ViTEraser can easily incorporate various ViTs to enhance long-range modeling. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin and exhibits strong generalization ability when extended to other tasks, e.g., tampered scene text detection. Furthermore, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to the STR field. Code is available at https://github.com/shannanyinxiang/ViTEraser. \ No newline at end of file diff --git a/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization b/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization new file mode 100644 index 0000000000..1c317bb0a4 --- /dev/null +++ b/data/2024/aaai/ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization @@ -0,0 +1 @@ +As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree. \ No newline at end of file diff --git a/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation b/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation new file mode 100644 index 0000000000..52f4683ed6 --- /dev/null +++ b/data/2024/aaai/Video Event Extraction with Multi-View Interaction Knowledge Distillation @@ -0,0 +1 @@ +Video event extraction (VEE) aims to extract key events and generate the event arguments for their semantic roles from the video. Despite promising results have been achieved by existing methods, they still lack an elaborate learning strategy to adequately consider: (1) inter-object interaction, which reflects the relation between objects; (2) inter-modality interaction, which aligns the features from text and video modality. In this paper, we propose a Multi-view Interaction with knowledge Distillation (MID) framework to solve the above problems with the Knowledge Distillation (KD) mechanism. Specifically, we propose the self-Relational KD (self-RKD) to enhance the inter-object interaction, where the relation between objects is measured by distance metric, and the high-level relational knowledge from the deeper layer is taken as the guidance for boosting the shallow layer in the video encoder. Meanwhile, to improve the inter-modality interaction, the Layer-to-layer KD (LKD) is proposed, which integrates additional cross-modal supervisions (i.e., the results of cross-attention) with the textual supervising signal for training each transformer decoder layer. Extensive experiments show that without any additional parameters, MID achieves the state-of-the-art performance compared to other strong methods in VEE. \ No newline at end of file diff --git a/data/2024/aaai/Video Frame Prediction from a Single Image and Events b/data/2024/aaai/Video Frame Prediction from a Single Image and Events new file mode 100644 index 0000000000..2dc85a55d4 --- /dev/null +++ b/data/2024/aaai/Video Frame Prediction from a Single Image and Events @@ -0,0 +1 @@ +Recently, the task of Video Frame Prediction (VFP), which predicts future video frames from previous ones through extrapolation, has made remarkable progress. However, the performance of existing VFP methods is still far from satisfactory due to the fixed framerate video used: 1) they have difficulties in handling complex dynamic scenes; 2) they cannot predict future frames with flexible prediction time intervals. The event cameras can record the intensity changes asynchronously with a very high temporal resolution, which provides rich dynamic information about the observed scenes. In this paper, we propose to predict video frames from a single image and the following events, which can not only handle complex dynamic scenes but also predict future frames with flexible prediction time intervals. First, we introduce a symmetrical cross-modal attention augmentation module to enhance the complementary information between images and events. Second, we propose to jointly achieve optical flow estimation and frame generation by combining the motion information of events and the semantic information of the image, then inpainting the holes produced by forward warping to obtain an ideal prediction frame. Based on these, we propose a lightweight pyramidal coarse-to-fine model that can predict a 720P frame within 25 ms. Extensive experiments show that our proposed model significantly outperforms the state-of-the-art frame-based and event-based VFP methods and has the fastest runtime. Code is available at https://npucvr.github.io/VFPSIE/. \ No newline at end of file diff --git a/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering b/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering new file mode 100644 index 0000000000..e3296fcf05 --- /dev/null +++ b/data/2024/aaai/Video-Context Aligned Transformer for Video Question Answering @@ -0,0 +1 @@ +Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module. \ No newline at end of file diff --git a/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract) b/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract) new file mode 100644 index 0000000000..59feb7011c --- /dev/null +++ b/data/2024/aaai/Virtual Action Actor-Critic Framework for Exploration (Student Abstract) @@ -0,0 +1 @@ +Efficient exploration for an agent is challenging in reinforcement learning (RL). In this paper, a novel actor-critic framework namely virtual action actor-critic (VAAC), is proposed to address the challenge of efficient exploration in RL. This work is inspired by humans' ability to imagine the potential outcomes of their actions without actually taking them. In order to emulate this ability, VAAC introduces a new actor called virtual actor (VA), alongside the conventional actor-critic framework. Unlike the conventional actor, the VA takes the virtual action to anticipate the next state without interacting with the environment. With the virtual policy following a Gaussian distribution, the VA is trained to maximize the anticipated novelty of the subsequent state resulting from a virtual action. If any next state resulting from available actions does not exhibit high anticipated novelty, training the VA leads to an increase in the virtual policy entropy. Hence, high virtual policy entropy represents that there is no room for exploration. The proposed VAAC aims to maximize a modified Q function, which combines cumulative rewards and the negative sum of virtual policy entropy. Experimental results show that the VAAC improves the exploration performance compared to existing algorithms. \ No newline at end of file diff --git a/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity b/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity new file mode 100644 index 0000000000..72d0527d28 --- /dev/null +++ b/data/2024/aaai/Virtual Try-On: Real-Time Interactive Hybrid Network with High-Fidelity @@ -0,0 +1 @@ +A significant upsurge in the fashion e-commerce industry in recent years has brought considerable attention to image-based virtual fitting. This image-based technology allows users to try on clothes virtually without physically touching them. However, the current techniques have notable limitations in terms of real-world scenarios, noisy results, partial clothing categories and computational cost, thus limiting the real-world applications. To address these critical limitations, we propose a hybrid interactive network that allows actual users to interact with the system to try on clothes virtually. The network is composed of state of art keypoint extraction, appearance flow alteration and wrapping modules. The pro-posed network facilitates real-time application with high-quality noise-free results, a variety of clothing categories and efficient computational cost. \ No newline at end of file diff --git a/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting b/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting new file mode 100644 index 0000000000..8fbaa87b1c --- /dev/null +++ b/data/2024/aaai/Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting @@ -0,0 +1 @@ +Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention.The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available. \ No newline at end of file diff --git a/data/2024/aaai/Vision-Language Models for Robot Success Detection b/data/2024/aaai/Vision-Language Models for Robot Success Detection new file mode 100644 index 0000000000..20bfd19435 --- /dev/null +++ b/data/2024/aaai/Vision-Language Models for Robot Success Detection @@ -0,0 +1 @@ +In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies. \ No newline at end of file diff --git a/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding b/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding new file mode 100644 index 0000000000..d0f57cb43f --- /dev/null +++ b/data/2024/aaai/Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding @@ -0,0 +1 @@ +In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. Code is available at https://github.com/iridescentttt/3DVLP. \ No newline at end of file diff --git a/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery b/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery new file mode 100644 index 0000000000..901b1fb7ff --- /dev/null +++ b/data/2024/aaai/Visual Abstract Reasoning in Computational Imagery @@ -0,0 +1 @@ +Despite current AI’s human-like behavior, super efficiency, and unbelievable ability to handle complex games, we still complain that it shows no sign of creativity, originality, or novelty outside its training set, and that it fails to develop new insights into old experience or establish understanding of new experience. In short, it generates content from its training set, but does not invent content. A fundamental reason for this is that current AI is incapable of abstraction and reasoning in an abstract, generalizable, and systematic way. Think, for instance, of what AI systems we can build if we have a base system that can answer this simple question—when two things are the same. Instead of studying these high-level questions, I put my thesis in the context of visual abstract reasoning (VAR), a task widely used in human intelligence tests. A classical example of this task is Raven’s Progressive Matrices (RPM, see Figure 1), a family of intelligence tests that was designed to measure eductive ability, i.e., the ability to make meaning out of confusion and generate high-level, usually nonverbal, schemata which make it easy to handle complexity. A similar concept to eductive ability is fluid intelligence, or the ability to discriminate and perceive complex relationships when no recourse to answers is stored in memory. Whether eductive ability or fluid intelligence, RPM points to the qualities that have been lacking in AI. To explore these qualities in AI, I propose the following research questions. \ No newline at end of file diff --git a/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models b/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models new file mode 100644 index 0000000000..388b795a93 --- /dev/null +++ b/data/2024/aaai/Visual Adversarial Examples Jailbreak Aligned Large Language Models @@ -0,0 +1,3 @@ +Warning: this paper contains data, prompts, and model outputs that are offensive in nature. + +Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models. \ No newline at end of file diff --git a/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning b/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning new file mode 100644 index 0000000000..df222be3b6 --- /dev/null +++ b/data/2024/aaai/Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning @@ -0,0 +1 @@ +Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git \ No newline at end of file diff --git a/data/2024/aaai/Visual Hallucination Elevates Speech Recognition b/data/2024/aaai/Visual Hallucination Elevates Speech Recognition new file mode 100644 index 0000000000..b826440a2d --- /dev/null +++ b/data/2024/aaai/Visual Hallucination Elevates Speech Recognition @@ -0,0 +1,16 @@ +Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always +available during inference, leading to the problem of +the missing visual modality, which restricts their practicality in real-world scenarios. + +To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities +during training, generating +visual hallucinations in lieu of +real videos during inference. To achieve that, the +primary challenge is to generate the visual hallucination +given the noisy audio while preserving semantic correspondences with the clean speech. To +tackle this challenge, we +start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to +discretize the continuous audio and visual feature spaces. The discretization step +allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate +visual hallucinations with high quality. +To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input. \ No newline at end of file diff --git a/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo b/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo new file mode 100644 index 0000000000..5a60cb0ce7 --- /dev/null +++ b/data/2024/aaai/Visual Instruction Tuning with Polite Flamingo @@ -0,0 +1 @@ +Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately - for instance, its "politeness" - due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations. Code and dataset are available at https://github.com/ChenDelong1999/polite-flamingo \ No newline at end of file diff --git a/data/2024/aaai/Visual Language - Let the Product Say What You Want b/data/2024/aaai/Visual Language - Let the Product Say What You Want new file mode 100644 index 0000000000..1dd5090af9 --- /dev/null +++ b/data/2024/aaai/Visual Language - Let the Product Say What You Want @@ -0,0 +1,2 @@ +Visual Language is a multitasking on-line system focusing on e-commerce, which involves in generating accurate product descriptions for sellers and providing convenient product retrieval service for customers. To achieve this goal, the system adopts image description technology and multi-modal retrieval technology. +By utilizing cross-modal generation technique, we could help sellers on rapid uploading products and customers on rapid retrieval, which could improve the experience of both sellers and customers. \ No newline at end of file diff --git a/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method b/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method new file mode 100644 index 0000000000..1b35ea0251 --- /dev/null +++ b/data/2024/aaai/Visual Redundancy Removal for Composite Images: A Benchmark Dataset and a Multi-Visual-Effects Driven Incremental Method @@ -0,0 +1 @@ +Composite images (CIs) typically combine various elements from different scenes, views, and styles, which are a very important information carrier in the era of mixed media such as virtual reality, mixed reality, metaverse, etc. However, the complexity of CI content presents a significant challenge for subsequent visual perception modeling and compression. In addition, the lack of benchmark CI databases also hinders the use of recent advanced data-driven methods. To address these challenges, we first establish one of the earliest visual redundancy prediction (VRP) databases for CIs. Moreover, we propose a multi-visual effect (MVE)-driven incremental learning method that combines the strengths of hand-crafted and data-driven approaches to achieve more accurate VRP modeling. Specifically, we design special incremental rules to learn the visual knowledge flow of MVE. To effectively capture the associated features of MVE, we further develop a three-stage incremental learning approach for VRP based on an encoder-decoder network. Extensive experimental results validate the superiority of the proposed method in terms of subjective, objective, and compression experiments. \ No newline at end of file diff --git a/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection b/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection new file mode 100644 index 0000000000..aab3b2e18d --- /dev/null +++ b/data/2024/aaai/Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection @@ -0,0 +1 @@ +Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we propose a hybrid detection framework named Voxel-Pillar Fusion (VPF), which synergistically combines the unique strengths of both voxels and pillars. To be concrete, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our computationally efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward representation, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset. \ No newline at end of file diff --git a/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation b/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation new file mode 100644 index 0000000000..19493fcf33 --- /dev/null +++ b/data/2024/aaai/W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation @@ -0,0 +1 @@ +Current weakly-supervised semantic segmentation (WSSS) techniques concentrate on enhancing class activation maps (CAMs) with image-level annotations. Yet, the emphasis on producing these pseudo-labels often overshadows the pivotal role of training the segmentation model itself. This paper underscores the significant influence of noisy pseudo-labels on segmentation network performance, particularly in boundary region. To address above issues, we introduce a novel paradigm: Weak to Partial Supervision (W2P). At its core, W2P categorizes the pseudo-labels from WSSS into two unique supervisions: trustworthy clean labels and uncertain noisy labels. Next, our proposed partially-supervised framework adeptly employs these clean labels to rectify the noisy ones, thereby promoting the continuous enhancement of the segmentation model. To further optimize boundary segmentation, we incorporate a noise detection mechanism that specifically preserves boundary regions while eliminating noise. During the noise refinement phase, we adopt a boundary-conscious noise correction technique to extract comprehensive boundaries from noisy areas. Furthermore, we devise a boundary generation approach that assists in predicting intricate boundary zones. Evaluations on the PASCAL VOC 2012 and MS COCO 2014 datasets confirm our method's impressive segmentation capabilities across various pseudo-labels. \ No newline at end of file diff --git a/data/2024/aaai/Wasserstein Differential Privacy b/data/2024/aaai/Wasserstein Differential Privacy new file mode 100644 index 0000000000..5084a2b290 --- /dev/null +++ b/data/2024/aaai/Wasserstein Differential Privacy @@ -0,0 +1,2 @@ +Differential privacy (DP) has achieved remarkable results in the field of privacy-preserving machine learning. However, existing DP frameworks do not satisfy all the conditions for becoming metrics, which prevents them from deriving better basic private properties and leads to exaggerated values on privacy budgets. We propose Wasserstein differential privacy (WDP), an alternative DP framework to measure the risk of privacy leakage, which satisfies the properties of symmetry and triangle inequality. We show and prove that WDP has 13 excellent properties, which can be theoretical supports for the better performance of WDP than other DP frameworks. +In addition, we derive a general privacy accounting method called Wasserstein accountant, which enables WDP to be applied in stochastic gradient descent (SGD) scenarios containing subsampling. Experiments on basic mechanisms, compositions and deep learning show that the privacy budgets obtained by Wasserstein accountant are relatively stable and less influenced by order. Moreover, the overestimation on privacy budgets can be effectively alleviated. The code is available at https://github.com/Hifipsysta/WDP. \ No newline at end of file diff --git a/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models b/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models new file mode 100644 index 0000000000..838c49291f --- /dev/null +++ b/data/2024/aaai/Watch Your Head: Assembling Projection Heads to Save the Reliability of Federated Models @@ -0,0 +1 @@ +Federated learning encounters substantial challenges with heterogeneous data, leading to performance degradation and convergence issues. While considerable progress has been achieved in mitigating such an impact, the reliability aspect of federated models has been largely disregarded. In this study, we conduct extensive experiments to investigate the reliability of both generic and personalized federated models. Our exploration uncovers a significant finding: federated models exhibit unreliability when faced with heterogeneous data, demonstrating poor calibration on in-distribution test data and low uncertainty levels on out-of-distribution data. This unreliability is primarily attributed to the presence of biased projection heads, which introduce miscalibration into the federated models. Inspired by this observation, we propose the "Assembled Projection Heads" (APH) method for enhancing the reliability of federated models. By treating the existing projection head parameters as priors, APH randomly samples multiple initialized parameters of projection heads from the prior and further performs targeted fine-tuning on locally available data under varying learning rates. Such a head ensemble introduces parameter diversity into the deterministic model, eliminating the bias and producing reliable predictions via head averaging. We evaluate the effectiveness of the proposed APH method across three prominent federated benchmarks. Experimental results validate the efficacy of APH in model calibration and uncertainty estimation. Notably, APH can be seamlessly integrated into various federated approaches but only requires less than 30% additional computation cost for 100x inferences within large models. \ No newline at end of file diff --git a/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy b/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy new file mode 100644 index 0000000000..92b8681663 --- /dev/null +++ b/data/2024/aaai/Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy @@ -0,0 +1 @@ +To mitigate potential risks associated with language models (LMs), recent AI detection research proposes incorporating watermarks into machine-generated text through random vocabulary restrictions and utilizing this information for detection. In this paper, we show that watermarking algorithms designed for LMs cannot be seamlessly applied to conditional text generation (CTG) tasks without a notable decline in downstream task performance. To address this issue, we introduce a simple yet effective semantic-aware watermarking algorithm that considers the characteristics of conditional text generation with the input context. Compared to the baseline watermarks, our proposed watermark yields significant improvements in both automatic and human evaluations across various text generation models, including BART and Flan-T5, for CTG tasks such as summarization and data-to-text generation. Meanwhile, it maintains detection ability with higher z-scores but lower AUC scores, suggesting the presence of a detection paradox that poses additional challenges for watermarking CTG. \ No newline at end of file diff --git a/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting b/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting new file mode 100644 index 0000000000..27e841dc17 --- /dev/null +++ b/data/2024/aaai/WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting @@ -0,0 +1 @@ +Video inpainting aims to fill in the missing regions of the video frames with plausible content. Benefiting from the outstanding long-range modeling capacity, the transformer-based models have achieved unprecedented performance regarding inpainting quality. Essentially, coherent contents from all the frames along both spatial and temporal dimensions are concerned by a patch-wise attention module, and then the missing contents are generated based on the attention-weighted summation. In this way, attention retrieval accuracy has become the main bottleneck to improve the video inpainting performance, where the factors affecting attention calculation should be explored to maximize the advantages of transformer. Towards this end, in this paper, we theoretically certificate that noise is the culprit that entangles the process of attention calculation. Meanwhile, we propose a novel wavelet transformer network with noise robustness for video inpainting, named WaveFormer. Unlike existing transformer-based methods that utilize the whole embeddings to calculate the attention, our WaveFormer first separates the noise existing in the embedding into high-frequency components by introducing the Discrete Wavelet Transform (DWT), and then adopts clean low-frequency components to calculate the attention. In this way, the impact of noise on attention computation can be greatly mitigated and the missing content regarding different frequencies can be generated by sharing the calculated attention. Extensive experiments validate the superior performance of our method over state-of-the-art baselines both qualitatively and quantitatively. \ No newline at end of file diff --git a/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets b/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets new file mode 100644 index 0000000000..e5e0c1c8c0 --- /dev/null +++ b/data/2024/aaai/WaveNet: Tackling Non-stationary Graph Signals via Graph Spectral Wavelets @@ -0,0 +1 @@ +In the existing spectral GNNs, polynomial-based methods occupy the mainstream in designing a filter through the Laplacian matrix. However, polynomial combinations factored by the Laplacian matrix naturally have limitations in message passing (e.g., over-smoothing). Furthermore, most existing spectral GNNs are based on polynomial bases, which struggle to capture the high-frequency parts of the graph spectral signal. Additionally, we also find that even increasing the polynomial order does not change this situation, which means polynomial-based models have a natural deficiency when facing high-frequency signals. To tackle these problems, we propose WaveNet, which aims to effectively capture the high-frequency part of the graph spectral signal from the perspective of wavelet bases through reconstructing the message propagation matrix. We utilize Multi-Resolution Analysis (MRA) to model this question, and our proposed method can reconstruct arbitrary filters theoretically. We also conduct node classification experiments on real-world graph benchmarks and achieve superior performance on most datasets. Our code is available at https://github.com/Bufordyang/WaveNet \ No newline at end of file diff --git a/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement b/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement new file mode 100644 index 0000000000..0758606ecb --- /dev/null +++ b/data/2024/aaai/Wavelet Dynamic Selection Network for Inertial Sensor Signal Enhancement @@ -0,0 +1 @@ +As attitude and motion sensing components, inertial sensors are widely used in various portable devices, covering consumer electronics, sports health, aerospace, etc. But the severe intrinsic errors of inertial sensors heavily restrain their function implementation, especially the advanced functionality, including motion trajectory recovery and motion semantic recognition, which attracts considerable attention. As a mainstream signal processing method, wavelet is hailed as the mathematical microscope of signal due to the plentiful and diverse wavelet basis functions. However, complicated noise types and application scenarios of inertial sensors make selecting wavelet basis perplexing. To this end, we propose a wavelet dynamic selection network (WDSNet), which intelligently selects the appropriate wavelet basis for variable inertial signals. In addition, existing deep learning architectures excel at extracting features from input data but neglect to learn the characteristics of target categories, which is essential to enhance the category awareness capability, thereby improving the selection of wavelet basis. Therefore, we propose a category representation mechanism (CRM), which enables the network to extract and represent category features without increasing trainable parameters. Furthermore, CRM transforms the common fully connected network into category representations, which provide closer supervision to the feature extractor than the far and trivial one-hot classification labels. We call this process of imposing interpretability on a network and using it to supervise the feature extractor the feature supervision mechanism, and its effectiveness is demonstrated experimentally and theoretically in this paper. The enhanced inertial signal can perform impracticable tasks with regard to the original signal, such as trajectory reconstruction. Both quantitative and visual results show that WDSNet outperforms the existing methods. Remarkably, WDSNet, as a weakly-supervised method, achieves the state-of-the-art performance of all the compared fully-supervised methods. \ No newline at end of file diff --git a/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations b/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations new file mode 100644 index 0000000000..147ae2033e --- /dev/null +++ b/data/2024/aaai/Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations @@ -0,0 +1 @@ +Spatiotemporal predictive learning is a paradigm that empowers models to learn spatial and temporal patterns by predicting future frames from past frames in an unsupervised manner. This method typically uses recurrent units to capture long-term dependencies, but these units often come with high computational costs and limited performance in real-world scenes. This paper presents an innovative Wavelet-based SpatioTemporal (WaST) framework, which extracts and adaptively controls both low and high-frequency components at image and feature levels via 3D discrete wavelet transform for faster processing while maintaining high-quality predictions. We propose a Time-Frequency Aware Translator uniquely crafted to efficiently learn short- and long-range spatiotemporal information by individually modeling spatial frequency and temporal variations. Meanwhile, we design a wavelet-domain High-Frequency Focal Loss that effectively supervises high-frequency variations. Extensive experiments across various real-world scenarios, such as driving scene prediction, traffic flow prediction, human motion capture, and weather forecasting, demonstrate that our proposed WaST achieves state-of-the-art performance over various spatiotemporal prediction methods. \ No newline at end of file diff --git a/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning b/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning new file mode 100644 index 0000000000..2c40dfbbc5 --- /dev/null +++ b/data/2024/aaai/Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning @@ -0,0 +1 @@ +We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting. \ No newline at end of file diff --git a/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection b/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection new file mode 100644 index 0000000000..6ce7cf60ac --- /dev/null +++ b/data/2024/aaai/WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection @@ -0,0 +1 @@ +Point cloud salient object detection (PCSOD) is a newly proposed task in 3D dense segmentation. However, the acquisition of accurate 3D dense annotations comes at a high cost, severely limiting the progress of PCSOD. To address this issue, we propose the first weakly supervised PCSOD (named WeakPCSOD) model, which relies solely on cheap 3D bounding box annotations. In WeakPCSOD, we extract noise-free supervision from coarse 3D bounding boxes while mitigating shape biases inherent in box annotations. To achieve this, we introduce a novel mask-to-box (M2B) transformation and a color consistency (CC) loss. The M2B transformation, from a shape perspective, disentangles predictions from labels, enabling the extraction of noiseless supervision from labels while preserving object shapes independently of the box bias. From an appearance perspective, we further introduce the CC loss to provide dense supervision, which mitigates the non-unique predictions stemming from weak supervision and substantially reduces prediction variability. Furthermore, we employ a self-training (ST) strategy to enhance performance by utilizing high-confidence pseudo labels. Notably, the M2B transformation, CC loss, and ST strategy are seamlessly integrated into any model and incur no computational costs for inference. Extensive experiments demonstrate the effectiveness of our WeakPCSOD model, even comparable to fully supervised models utilizing dense annotations. \ No newline at end of file diff --git a/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR b/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR new file mode 100644 index 0000000000..9cad3dd9ff --- /dev/null +++ b/data/2024/aaai/Weakly Supervised Few-Shot Object Detection with DETR @@ -0,0 +1 @@ +In recent years, Few-shot Object Detection (FSOD) has become an increasingly important research topic in computer vision. However, existing FSOD methods require strong annotations including category labels and bounding boxes, and their performance is heavily dependent on the quality of box annotations. However, acquiring strong annotations is both expensive and time-consuming. This inspires the study on weakly supervised FSOD (WS-FSOD in short), which realizes FSOD with only image-level annotations, i.e., category labels. In this paper, we propose a new and effective weakly supervised FSOD method named WFS-DETR. By a well-designed pretraining process, WFS-DETR first acquires general object localization and integrity judgment capabilities on large-scale pretraining data. Then, it introduces object integrity into multiple-instance learning to solve the common local optimum problem by comprehensively exploiting both semantic and visual information. Finally, with simple fine-tuning, it transfers the knowledge learned from the base classes to the novel classes, which enables accurate detection of novel objects. Benefiting from this ``pretraining-refinement'' mechanism, WSF-DETR can achieve good generalization on different datasets. Extensive experiments also show that the proposed method clearly outperforms the existing counterparts in the WS-FSOD task. \ No newline at end of file diff --git a/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images b/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images new file mode 100644 index 0000000000..d4210d33e6 --- /dev/null +++ b/data/2024/aaai/Weakly Supervised Multimodal Affordance Grounding for Egocentric Images @@ -0,0 +1,3 @@ +To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict human-object interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images. +To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances. +We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach. Code:https://github.com/xulingjing88/WSMA. \ No newline at end of file diff --git a/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes b/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes new file mode 100644 index 0000000000..74fd29b07b --- /dev/null +++ b/data/2024/aaai/Weakly Supervised Semantic Segmentation for Driving Scenes @@ -0,0 +1 @@ +State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS) using image-level labels exhibit severe performance degradation on driving scene datasets such as Cityscapes. To address this challenge, we develop a new WSSS framework tailored to driving scene datasets. Based on extensive analysis of dataset characteristics, we employ Contrastive Language-Image Pre-training (CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key challenges: (1) pseudo-masks from CLIP lack in representing small object classes, and (2) these masks contain notable noise. We propose solutions for each issue as follows. (1) We devise Global-Local View Training that seamlessly incorporates small-scale patches during model training, thereby enhancing the model's capability to handle small-sized yet critical objects in driving scenes (e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing (CARB), a novel technique that discerns reliable and noisy regions through evaluating the consistency between CLIP masks and segmentation predictions. It prioritizes reliable pixels over noisy pixels via adaptive loss weighting. Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test dataset, showcasing its potential as a strong WSSS baseline on driving scene datasets. Experimental results on CamVid and WildDash2 demonstrate the effectiveness of our method across diverse datasets, even with small-scale datasets or visually challenging conditions. The code is available at https://github.com/k0u-id/CARB. \ No newline at end of file diff --git a/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations b/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations new file mode 100644 index 0000000000..1ed96a4b87 --- /dev/null +++ b/data/2024/aaai/Weakly-Supervised Mirror Detection via Scribble Annotations @@ -0,0 +1 @@ +Mirror detection is of great significance for avoiding false recognition of reflected objects in computer vision tasks. Existing mirror detection frameworks usually follow a supervised setting, which relies heavily on high quality labels and suffers from poor generalization. To resolve this, we instead propose the first weakly-supervised mirror detection framework and also provide the first scribble-based mirror dataset. Specifically, we relabel 10,158 images, most of which have a labeled pixel ratio of less than 0.01 and take only about 8 seconds to label. Considering that the mirror regions usually show great scale variation, and also irregular and occluded, thus leading to issues of incomplete or over detection, we propose a local-global feature enhancement (LGFE) module to fully capture the context and details. Moreover, it is difficult to obtain basic mirror structure using scribble annotation, and the distinction between foreground (mirror) and background (non-mirror) features is not emphasized caused by mirror reflections. Therefore, we propose a foreground-aware mask attention (FAMA), integrating mirror edges and semantic features to complete mirror regions and suppressing the influence of backgrounds. Finally, to improve the robustness of the network, we propose a prototype contrast loss (PCL) to learn more general foreground features across images. Extensive experiments show that our network outperforms relevant state-of-the-art weakly supervised methods, and even some fully supervised methods. The dataset and codes are available at https://github.com/winter-flow/WSMD. \ No newline at end of file diff --git a/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature b/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature new file mode 100644 index 0000000000..af66ad24e5 --- /dev/null +++ b/data/2024/aaai/Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature @@ -0,0 +1 @@ +Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos simultaneously by taking only video-level labels as the supervision. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video that can provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature. First, we design a saliency inference module that exploits the variation relationship between temporal neighbor snippets to discover salient snippet-features, which can reflect the significant dynamic change in the video. Secondly, we introduce a boundary refinement module that enhances salient snippet-features through the information interaction unit. Then, a discrimination enhancement module is introduced to enhance the discriminative nature of snippet-features. Finally, we adopt the refined snippet-features to produce high-fidelity pseudo labels, which could be used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods. Our source code is available at https://github.com/wuli55555/ISSF. \ No newline at end of file diff --git a/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites b/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites new file mode 100644 index 0000000000..d0cce8a3d1 --- /dev/null +++ b/data/2024/aaai/WebVLN: Vision-and-Language Navigation on Websites @@ -0,0 +1 @@ +Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN. \ No newline at end of file diff --git a/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation b/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation new file mode 100644 index 0000000000..0645af4b64 --- /dev/null +++ b/data/2024/aaai/WeditGAN: Few-Shot Image Generation via Latent Space Relocation @@ -0,0 +1 @@ +In few-shot image generation, directly training GAN models on just a handful of images faces the risk of overfitting. A popular solution is to transfer the models pretrained on large source domains to small target ones. In this work, we introduce WeditGAN, which realizes model transfer by editing the intermediate latent codes w in StyleGANs with learned constant offsets (delta w), discovering and constructing target latent spaces via simply relocating the distribution of source latent spaces. The established one-to-one mapping between latent spaces can naturally prevents mode collapse and overfitting. Besides, we also propose variants of WeditGAN to further enhance the relocation process by regularizing the direction or finetuning the intensity of delta w. Experiments on a collection of widely used source/target datasets manifest the capability of WeditGAN in generating realistic and diverse images, which is simple yet highly effective in the research area of few-shot image generation. Codes are available at https://github.com/Ldhlwh/WeditGAN. \ No newline at end of file diff --git a/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes b/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes new file mode 100644 index 0000000000..2652a73b85 --- /dev/null +++ b/data/2024/aaai/Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes @@ -0,0 +1 @@ +Graph Neural Networks (GNNs), despite achieving remarkable performance across different tasks, are theoretically bounded by the 1-Weisfeiler-Lehman test, resulting in limitations in terms of graph expressivity. Even though prior works on topological higher-order GNNs overcome that boundary, these models often depend on assumptions about sub-structures of graphs. Specifically, topological GNNs leverage the prevalence of cliques, cycles, and rings to enhance the message-passing procedure. Our study presents a novel perspective by focusing on simple paths within graphs during the topological message-passing process, thus liberating the model from restrictive inductive biases. We prove that by lifting graphs to path complexes, our model can generalize the existing works on topology while inheriting several theoretical results on simplicial complexes and regular cell complexes. Without making prior assumptions about graph sub-structures, our method outperforms earlier works in other topological domains and achieves state-of-the-art results on various benchmarks. \ No newline at end of file diff --git a/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract) b/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract) new file mode 100644 index 0000000000..f9ae7daa29 --- /dev/null +++ b/data/2024/aaai/Welfare Maximization in Perpetual Voting (Student Abstract) @@ -0,0 +1,2 @@ +We study the computational problems associated with maximizing various welfare objectives—namely utilitarian welfare, egalitarian welfare, and Nash welfare—in perpetual voting, a sequential collective decision-making framework. Prior work look into notions of fairness over time and study extensions of single-round voting rules to the multi-round setting. +We show that while a utilitarian-welfare maximizing outcome can be computed efficiently, an outcome that maximizes egalitarian or Nash welfare is computationally intractable, even in the case of two candidates. We complement this by showing that maximizing egalitarian welfare is fixed-parameter tractable in the number of agents, and maximizing egalitarian or Nash welfare is W[2]-hard and slicewise polynomial in the number of timesteps. We also provide an approximation algorithm for maximizing egalitarian welfare and study strategyproofness with respect to these welfare objectives. Finally, we show that a simple greedy algorithm can achieve approximate proportionality in this setting. \ No newline at end of file diff --git a/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning b/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning new file mode 100644 index 0000000000..c62f65157f --- /dev/null +++ b/data/2024/aaai/Well, Now We Know! Unveiling Sarcasm: Initiating and Exploring Multimodal Conversations with Reasoning @@ -0,0 +1,3 @@ +Sarcasm is a widespread linguistic phenomenon that poses a considerable challenge to explain due to its subjective nature, absence of contextual cues, and rooted personal +perspectives. Even though the identification of sarcasm has been extensively studied in dialogue analysis, merely detecting sarcasm falls short of enabling conversational systems to genuinely comprehend the underlying meaning of a conversation and generate fitting responses. It is imperative to not only detect sarcasm but also pinpoint its origination and the rationale behind the sarcastic expressions to capture its authentic essence. In this paper, we delve into the discourse structure of conversations infused with sarcasm and introduce a novel task - Sarcasm Initiation and Reasoning in Conversations (SIRC). Embedded in a multimodal environment and +involving a combination of both English and code-mixed interactions, the objective of the task is to discern the trigger or starting point of sarcasm. Additionally, the task involves producing a natural language explanation that rationalizes the satirical dialogues. To this end, we introduce Sarcasm Initiation and Reasoning Dataset (SIRD) to facilitate our task and provide sarcasm initiation annotations and reasoning. We develop a comprehensive model named Sarcasm Initiation and Reasoning Generation (SIRG), which is designed to encompass textual, audio, and visual representations. To achieve this, we introduce a unique shared fusion method that employs cross-attention mechanisms to seamlessly integrate these diverse modalities. Our experimental outcomes, conducted on the SIRC dataset, demonstrate that our proposed framework establishes a new benchmark for both sarcasm initiation and its reasoning generation in the context of multimodal conversations. The code and dataset can be accessed from https://www.iitp.ac.in/∼ai-nlp-ml resources.html#sarcasm-explain and https://github.com/GussailRaat/SIRG-Sarcasm-Initiation-and-Reasoning-Generation. \ No newline at end of file diff --git a/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract) b/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract) new file mode 100644 index 0000000000..d1ebbdbfc6 --- /dev/null +++ b/data/2024/aaai/Well-Written Knowledge Graphs: Most Effective RDF Syntaxes for Triple Linearization in End-to-End Extraction of Relations from Texts (Student Abstract) @@ -0,0 +1 @@ +Seq-to-seq generative models recently gained attention for solving the relation extraction task. By approaching this problem as an end-to-end task, they surpassed encoder-based-only models. Little research investigated the effects of the output syntaxes on the training process of these models. Moreover, a limited number of approaches were proposed for extracting ready-to-load knowledge graphs following the RDF standard. In this paper, we consider that a set of triples can be linearized in many different ways, and we evaluate the combined effect of the size of the language models and different RDF syntaxes on the task of relation extraction from Wikipedia abstracts. \ No newline at end of file diff --git a/data/2024/aaai/What Are the Rules? Discovering Constraints from Data b/data/2024/aaai/What Are the Rules? Discovering Constraints from Data new file mode 100644 index 0000000000..0baa3730f6 --- /dev/null +++ b/data/2024/aaai/What Are the Rules? Discovering Constraints from Data @@ -0,0 +1,2 @@ +Constraint programming and AI planning are powerful tools for solving assignment, optimization, and scheduling problems. They require, however, the rarely available combination of domain knowledge and mathematical modeling expertise. Learning constraints from exemplary solutions can close this gap and alleviate the effort of modeling. Existing approaches either require extensive user interaction, need exemplary invalid solutions that must be generated by experts at great expense, or show high noise-sensitivity. +We aim to find constraints from potentially noisy solutions, without the need of user interaction. To this end, we formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we select the model with the best lossless compression of the data. Solving the problem involves model counting, which is #P-hard to approximate. We therefore propose the greedy URPILS algorithm to find high-quality constraints in practice. Extensive experiments on constraint programming and AI planning benchmark data show URPILS not only finds more accurate and succinct constraints, but also is more robust to noise, and has lower sample complexity than the state of the art. \ No newline at end of file diff --git a/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning b/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning new file mode 100644 index 0000000000..2185eff572 --- /dev/null +++ b/data/2024/aaai/What Do Hebbian Learners Learn? Reduction Axioms for Iterated Hebbian Learning @@ -0,0 +1 @@ +This paper is a contribution to neural network semantics, a foundational framework for neuro-symbolic AI. The key insight of this theory is that logical operators can be mapped to operators on neural network states. In this paper, we do this for a neural network learning operator. We map a dynamic operator [φ] to iterated Hebbian learning, a simple learning policy that updates a neural network by repeatedly applying Hebb's learning rule until the net reaches a fixed-point. Our main result is that we can "translate away" [φ]-formulas via reduction axioms. This means that completeness for the logic of iterated Hebbian learning follows from completeness of the base logic. These reduction axioms also provide (1) a human-interpretable description of iterated Hebbian learning as a kind of plausibility upgrade, and (2) an approach to building neural networks with guarantees on what they can learn. \ No newline at end of file diff --git a/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases b/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases new file mode 100644 index 0000000000..4458102fb7 --- /dev/null +++ b/data/2024/aaai/What Does a Query Answer Tell You? Informativeness of Query Answers for Knowledge Bases @@ -0,0 +1 @@ +Query answering for Knowledge Bases (KBs) amounts to extracting information from the various models of a KB, and presenting the user with an object that represents such information. In the vast majority of cases, this object consists of those tuples of constants that satisfy the query expression either in every model (certain answers) or in some model (possible answers). However, similarly to the case of incomplete databases, both these forms of answers are a lossy representation of all the knowledge inferable from the query and the queried KB. In this paper, we illustrate a formal framework to characterize the information that query answers for KBs are able to represent. As a first application of the framework, we study the informativeness of current query answering approaches, including the recently introduced partial answers. We then define a novel notion of answers, allowing repetition of variables across answer tuples. We show that these answers are capable of representing a meaningful form of information, and we also study their data complexity properties. \ No newline at end of file diff --git a/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction b/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction new file mode 100644 index 0000000000..34f373f1d5 --- /dev/null +++ b/data/2024/aaai/What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction @@ -0,0 +1 @@ +In visual Reinforcement Learning (RL), the challenge of generalization to new environments is paramount. This study pioneers a theoretical analysis of visual RL generalization, establishing an upper bound on the generalization objective, encompassing policy divergence and Bellman error components. Motivated by this analysis, we propose maintaining the cross-domain consistency for each policy in the policy space, which can reduce the divergence of the learned policy during the test. In practice, we introduce the Truncated Return Prediction (TRP) task, promoting cross-domain policy consistency by predicting truncated returns of historical trajectories. Moreover, we also propose a Transformer-based predictor for this auxiliary task. Extensive experiments on DeepMind Control Suite and Robotic Manipulation tasks demonstrate that TRP achieves state-of-the-art generalization performance. We further demonstrate that TRP outperforms previous methods in terms of sample efficiency during training. \ No newline at end of file diff --git a/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception b/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception new file mode 100644 index 0000000000..b71d2cb467 --- /dev/null +++ b/data/2024/aaai/What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception @@ -0,0 +1 @@ +Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaborative encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at https://github.com/77SWF/CMiMC. \ No newline at end of file diff --git a/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation b/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation new file mode 100644 index 0000000000..5c824614d5 --- /dev/null +++ b/data/2024/aaai/What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation @@ -0,0 +1,2 @@ +Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach ``the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. +To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance. \ No newline at end of file diff --git a/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection b/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection new file mode 100644 index 0000000000..b6e895b919 --- /dev/null +++ b/data/2024/aaai/What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection @@ -0,0 +1 @@ +The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition. \ No newline at end of file diff --git a/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning b/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning new file mode 100644 index 0000000000..bae8e99bc5 --- /dev/null +++ b/data/2024/aaai/When CEGAR Meets Regression: A Love Story in Optimal Classical Planning @@ -0,0 +1,3 @@ +Counterexample-Guided Abstraction Refinement (CEGAR) is a prominent technique to generate Cartesian abstractions for guiding search in cost- optimal planning. The core idea is to iteratively refine the abstraction, finding a flaw of the current optimal abstract plan. All existing approaches find these flaws by executing the abstract plan using progression in the original state space. + +Instead, we propose to do backward refinements by using regression from the goals. This results in a new type of flaw, that can identify invalid plan suffixes. The resulting abstractions are less focused on the initial state, but more informative on average, significantly improving the performance of current CEGAR-based techniques. Furthermore, they can be combined with forward refinements in several bidirectional strategies that provide the benefits of both methods. \ No newline at end of file diff --git a/data/2024/aaai/When Causal Inference Meets Graph Machine Learning b/data/2024/aaai/When Causal Inference Meets Graph Machine Learning new file mode 100644 index 0000000000..4abfd51af2 --- /dev/null +++ b/data/2024/aaai/When Causal Inference Meets Graph Machine Learning @@ -0,0 +1 @@ +Graphs (i.e., networks) are ubiquitous in daily life, as they can effectively model a plethora of real-world systems with connected units, such as social networks and biological networks. Recent years have witnessed rapid development in graph-based machine learning (GML) in various high-impact domains. Currently, the mainstream GML methods are based on statistical learning, e.g., utilizing the statistical correlations between node features, graph structure, and labels for node classification. However, statistical learning has been widely criticized for only capturing the superficial relations between variables in the data system, and consequently, rendering the lack of trustworthiness in real-world applications. Therefore, it is crucial to understand the causality in the data system and the learning process. Causal inference is the discipline that investigates the causality inside a system, for example, to identify and estimate the causal effect of a certain treatment (e.g., wearing a face mask) on an important outcome (e.g., COVID-19 infection). Involving the concepts and philosophy of causal inference in ML methods is often considered significant for human-level intelligence and can serve as the foundation of artificial intelligence (AI). However, most traditional causal inference studies rely on strong assumptions, and focus on independent and identically distributed (i.i.d.) data, while causal inference on graphs is faced with many barriers. Therefore, we aim to bridge the gap between causal inference and GML. \ No newline at end of file diff --git a/data/2024/aaai/When Do Program-of-Thought Works for Reasoning? b/data/2024/aaai/When Do Program-of-Thought Works for Reasoning? new file mode 100644 index 0000000000..922658971b --- /dev/null +++ b/data/2024/aaai/When Do Program-of-Thought Works for Reasoning? @@ -0,0 +1 @@ +As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses. \ No newline at end of file diff --git a/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection b/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection new file mode 100644 index 0000000000..e4db1c2870 --- /dev/null +++ b/data/2024/aaai/When Model Meets New Normals: Test-Time Adaptation for Unsupervised Time-Series Anomaly Detection @@ -0,0 +1 @@ +Time-series anomaly detection deals with the problem of detecting anomalous timesteps by learning normality from the sequence of observations. However, the concept of normality evolves over time, leading to a "new normal problem", where the distribution of normality can be changed due to the distribution shifts between training and test data. This paper highlights the prevalence of the new normal problem in unsupervised time-series anomaly detection studies. To tackle this issue, we propose a simple yet effective test-time adaptation strategy based on trend estimation and a self-supervised approach to learning new normalities during inference. Extensive experiments on real-world benchmarks demonstrate that incorporating the proposed strategy into the anomaly detector consistently improves the model's performances compared to the existing baselines, leading to robustness to the distribution shifts. \ No newline at end of file diff --git a/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract) b/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract) new file mode 100644 index 0000000000..70bd2d0a5c --- /dev/null +++ b/data/2024/aaai/When Sparse Graph Representation Learning Falls into Domain Shift: Data Augmentation for Cross-Domain Graph Meta-Learning (Student Abstract) @@ -0,0 +1 @@ +Cross-domain Graph Meta-learning (CGML) has shown its promise, where meta-knowledge is extracted from few-shot graph data in multiple relevant but distinct domains. However, several recent efforts assume target data available, which commonly does not established in practice. In this paper, we devise a novel Cross-domain Data Augmentation for Graph Meta-Learning (CDA-GML), which incorporates the superiorities of CGML and Data Augmentation, has addressed intractable shortcomings of label sparsity, domain shift, and the absence of target data simultaneously. Specifically, our method simulates instance-level and task-level domain shift to alleviate the cross-domain generalization issue in conventional graph meta-learning. Experiments show that our method outperforms the existing state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices b/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices new file mode 100644 index 0000000000..2c4c4c3c82 --- /dev/null +++ b/data/2024/aaai/When Your AI Becomes a Target: AI Security Incidents and Best Practices @@ -0,0 +1,3 @@ +In contrast to vast academic efforts to study AI security, few real-world reports of AI security incidents exist. Released incidents prevent a thorough investigation of the attackers' motives, as crucial information about the company and AI application is missing. As a consequence, it often remains unknown how to avoid incidents. +We tackle this gap and combine previous reports with freshly collected incidents to a small database of 32 AI security incidents. We analyze the attackers' target and goal, influencing factors, causes, and mitigations. Many incidents stem from non-compliance with best practices in security and privacy-enhancing technologies. +In the case of direct AI attacks, access control may provide some mitigation, but there is little scientific work on best practices. Our paper is thus a call for action to address these gaps. \ No newline at end of file diff --git a/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks b/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks new file mode 100644 index 0000000000..64b4f17e2d --- /dev/null +++ b/data/2024/aaai/When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks @@ -0,0 +1 @@ +Neural growth is the process of growing a small neural network to a large network and has been utilized to accelerate the training of deep neural networks. One crucial aspect of neural growth is determining the optimal growth timing. However, few studies investigate this systematically. Our study reveals that neural growth inherently exhibits a regularization effect, whose intensity is influenced by the chosen policy for growth timing. While this regularization effect may mitigate the overfitting risk of the model, it may lead to a notable accuracy drop when the model underfits. Yet, current approaches have not addressed this issue due to their lack of consideration of the regularization effect from neural growth. Motivated by these findings, we propose an under/over fitting risk-aware growth timing policy, which automatically adjusts the growth timing informed by the level of potential under/overfitting risks to address both risks. Comprehensive experiments conducted using CIFAR-10/100 and ImageNet datasets show that the proposed policy achieves accuracy improvements of up to 1.3% in models prone to underfitting while achieving similar accuracies in models suffering from overfitting compared to the existing methods. \ No newline at end of file diff --git a/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming b/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming new file mode 100644 index 0000000000..7955ad4c5e --- /dev/null +++ b/data/2024/aaai/When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming @@ -0,0 +1 @@ +AI powered code-recommendation systems, such as Copilot and CodeWhisperer, provide code suggestions inside a programmer's environment (e.g., an IDE) with the aim of improving productivity. We pursue mechanisms for leveraging signals about programmers' acceptance and rejection of code suggestions to guide recommendations. We harness data drawn from interactions with GitHub Copilot, a system used by millions of programmers, to develop interventions that can save time for programmers. We introduce a utility-theoretic framework to drive decisions about suggestions to display versus withhold. The approach, conditional suggestion display from human feedback (CDHF), relies on a cascade of models that provide the likelihood that recommended code will be accepted. These likelihoods are used to selectively hide suggestions, reducing both latency and programmer verification time. Using data from 535 programmers, we perform a retrospective evaluation of CDHF and show that we can avoid displaying a significant fraction of suggestions that would have been rejected. We further demonstrate the importance of incorporating the programmer's latent unobserved state in decisions about when to display suggestions through an ablation study. Finally, we showcase how using suggestion acceptance as a reward signal for guiding the display of suggestions can lead to suggestions of reduced quality, indicating an unexpected pitfall. \ No newline at end of file diff --git a/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples b/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples new file mode 100644 index 0000000000..a156c4dc78 --- /dev/null +++ b/data/2024/aaai/Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples @@ -0,0 +1 @@ +Deep neural networks (DNNs) have been demonstrated to be vulnerable to well-crafted adversarial examples, which are generated through either well-conceived L_p-norm restricted or unrestricted attacks. Nevertheless, the majority of those approaches assume that adversaries can modify any features as they wish, and neglect the causal generating process of the data, which is unreasonable and unpractical. For instance, a modification in income would inevitably impact features like the debt-to-income ratio within a banking system. By considering the underappreciated causal generating process, first, we pinpoint the source of the vulnerability of DNNs via the lens of causality, then give theoretical results to answer where to attack. Second, considering the consequences of the attack interventions on the current state of the examples to generate more realistic adversarial examples, we propose CADE, a framework that can generate Counterfactual ADversarial Examples to answer how to attack. The empirical results demonstrate CADE's effectiveness, as evidenced by its competitive performance across diverse attack scenarios, including white-box, transfer-based, and random intervention attacks. \ No newline at end of file diff --git a/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering? b/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering? new file mode 100644 index 0000000000..ecb690fda1 --- /dev/null +++ b/data/2024/aaai/Which Is More Effective in Label Noise Cleaning, Correction or Filtering? @@ -0,0 +1 @@ +Most noise cleaning methods adopt one of the correction and filtering modes to build robust models. However, their effectiveness, applicability, and hyper-parameter insensitivity have not been carefully studied. We compare the two cleaning modes via a rebuilt error bound in noisy environments. At the dataset level, Theorem 5 implies that correction is more effective than filtering when the cleaned datasets have close noise rates. At the sample level, Theorem 6 indicates that confident label noises (large noise probabilities) are more suitable to be corrected, and unconfident noises (medium noise probabilities) should be filtered. Besides, an imperfect hyper-parameter may have fewer negative impacts on filtering than correction. Unlike existing methods with a single cleaning mode, the proposed Fusion cleaning framework of Correction and Filtering (FCF) combines the advantages of different modes to deal with diverse suspicious labels. Experimental results demonstrate that our FCF method can achieve state-of-the-art performance on benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search b/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search new file mode 100644 index 0000000000..1c0a552e4c --- /dev/null +++ b/data/2024/aaai/Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search @@ -0,0 +1 @@ +There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate - necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs. \ No newline at end of file diff --git a/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia b/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia new file mode 100644 index 0000000000..a1645c6b2a --- /dev/null +++ b/data/2024/aaai/WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia @@ -0,0 +1 @@ +Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP. \ No newline at end of file diff --git a/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning b/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning new file mode 100644 index 0000000000..743b96087b --- /dev/null +++ b/data/2024/aaai/Winnie: Task-Oriented Dialog System with Structure-Aware Contrastive Learning and Enhanced Policy Planning @@ -0,0 +1 @@ +Pre-trained encoder-decoder models are widely applied in Task-Oriented Dialog (TOD) systems on the session level, mainly focusing on modeling the dialog semantic information. Dialogs imply structural information indicating the interaction among user utterances, belief states, database search results, system acts and responses, which is also crucial for TOD systems. In addition, for the system acts, additional pre-training and datasets are considered to improve their accuracies, undoubtedly introducing a burden. Therefore, a novel end-to-end TOD system named Winnie is proposed in this paper to improve the TOD performance. First, to make full use of the intrinsic structural information, supervised contrastive learning is adopted to narrow the gap in the representation space between text representations of the same category and enlarge the overall continuous representation margin between text representations of different categories in dialog context. Then, a system act classification task is introduced for policy optimization during fine-tuning. Empirical results show that Winnie substantially improves the performance of the TOD system. By introducing the supervised contrastive and system act classification losses, Winnie achieves state-of-the-art results on benchmark datasets, including MultiWOZ2.2, In-Car, and Camrest676. Their end-to-end combined scores are improved by 3.2, 1.9, and 1.1 points, respectively. \ No newline at end of file diff --git a/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study b/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study new file mode 100644 index 0000000000..2c11479d62 --- /dev/null +++ b/data/2024/aaai/Working Memory Capacity of ChatGPT: An Empirical Study @@ -0,0 +1 @@ +Working memory is a critical aspect of both human intelligence and artificial intelligence, serving as a workspace for the temporary storage and manipulation of information. In this paper, we systematically assess the working memory capacity of ChatGPT, a large language model developed by OpenAI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that ChatGPT has a working memory capacity limit strikingly similar to that of humans. Furthermore, we investigate the impact of different instruction strategies on ChatGPT's performance and observe that the fundamental patterns of a capacity limit persist. From our empirical findings, we propose that n-back tasks may serve as tools for benchmarking the working memory capacity of large language models and hold potential for informing future efforts aimed at enhancing AI working memory. \ No newline at end of file diff --git a/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis b/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis new file mode 100644 index 0000000000..2c84b4a1a6 --- /dev/null +++ b/data/2024/aaai/Worst-Case VCG Redistribution Mechanism Design Based on the Lottery Ticket Hypothesis @@ -0,0 +1,7 @@ +We study worst-case VCG redistribution mechanism design for the public project problem. The mechanism design task comes down to designing a payment function that maximizes the worst-case allocative efficiency ratio. + +We use a multilayer perceptron (MLP) with ReLU activation to model the payment function and use mixed integer programming (MIP) to solve for the worst-case type profiles that maximally violate the mechanism design constraints. We collect these worst-case type profiles and use them as training samples to train toward better worst-case mechanisms. + +In practice, we require a tiny neural network structure for the above approach to scale. The Lottery Ticket Hypothesis states that a large network is likely to contain a "winning ticket" -- a much smaller subnetwork that "won the initialization lottery", which makes its training particularly effective. Motivated by this hypothesis, we train a large network and prune it into a tiny subnetwork. We run MIP-based worst-case training on the drawn subnetwork and evaluate the resulting mechanism's worst-case performance. If the subnetwork does not achieve good worst-case performance, then we record the type profiles that cause the current draw to be bad. To draw again, we restore the large network to its initial weights and prune using recorded type profiles from earlier draws, therefore avoiding drawing the same ticket twice. We expect to eventually encounter a tiny subnetwork that leads to effective training for our worst-case mechanism design task. Lastly, a by-product of multiple ticket draws is an ensemble of mechanisms with different worst cases, which improves the worst-case performance further. + +Using our approach, we find previously unknown optimal mechanisms for up to 5 agents. Our results confirm the tightness of existing theoretical upper bounds. For up to 20 agents, we derive significantly improved worst-case mechanisms, surpassing a long list of existing manual results. \ No newline at end of file diff --git a/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework b/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework new file mode 100644 index 0000000000..7ded268737 --- /dev/null +++ b/data/2024/aaai/Would You Like Your Data to Be Trained? A User Controllable Recommendation Framework @@ -0,0 +1 @@ +Recommender systems have a significant impact on various real-world applications, shaping people's daily lives and enhancing productivity. Traditional recommender models aim to collect extensive user information to accurately estimate user preferences. However, in practical scenarios, users may not want all their behaviors to be included in the model training process. This paper introduces a novel recommendation paradigm that allows users to indicate their ``willingness'' regarding which data should contribute to model training. The models are then optimized to maximize utility, which considers the trade-off between recommendation performance and respecting user preferences. The recommendation problem is formulated as a multiplayer game, with each user acting as a player and using a selection vector to indicate their willingness to include specific interacted items in training. To efficiently solve this game, an influence function-based model is proposed to approximate recommendation performances for different actions without re-optimizing the model. Furthermore, an enhanced model leveraging multiple anchor actions for the influence function is introduced to improve performance approximation accuracy. The convergence rate of the algorithm is theoretically analyzed, and the advantages of incorporating multiple anchor actions are demonstrated. Extensive experiments on both simulated and real-world datasets validate the effectiveness of the proposed models in balancing recommendation quality and user willingness. To promote this research direction, we have released our project at https://paitesanshi.github.io/IFRQE/. \ No newline at end of file diff --git a/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks b/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks new file mode 100644 index 0000000000..ab9fe0c3ca --- /dev/null +++ b/data/2024/aaai/X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks @@ -0,0 +1 @@ +Referring 3D instance segmentation is a challenging task aimed at accurately segmenting a target instance within a 3D scene based on a given referring expression. However, previous methods have overlooked the distinct roles played by different words in referring expressions. Additionally, they have failed to incorporate the positional relationship within referring expressions with the spatial correlations in 3D scenes. To alleviate these issues, we present a novel model called X-RefSeg3D, which constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. Our approach begins by capturing object-specific text features, which are then fused with the instance features to construct a comprehensive cross-modal scene graph. Subsequently, we integrate the obtained cross-modal features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of our method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU. \ No newline at end of file diff --git a/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer b/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer new file mode 100644 index 0000000000..633d9c8cc2 --- /dev/null +++ b/data/2024/aaai/X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer @@ -0,0 +1 @@ +The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D. \ No newline at end of file diff --git a/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning b/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning new file mode 100644 index 0000000000..9df18c70aa --- /dev/null +++ b/data/2024/aaai/XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning @@ -0,0 +1,2 @@ +We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. +Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8% to 14% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%. \ No newline at end of file diff --git a/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation b/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation new file mode 100644 index 0000000000..e5dcf04306 --- /dev/null +++ b/data/2024/aaai/Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation @@ -0,0 +1 @@ +New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with 14,041 questions and Xiezhi-Interdiscipline with 10,746 questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. All the evaluation code and data are open sourced in https://github.com/MikeGu721/XiezhiBenchmark \ No newline at end of file diff --git a/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos b/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos new file mode 100644 index 0000000000..a1ddfad88d --- /dev/null +++ b/data/2024/aaai/YTCommentQA: Video Question Answerability in Instructional Videos @@ -0,0 +1 @@ +Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA. \ No newline at end of file diff --git a/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification b/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification new file mode 100644 index 0000000000..df95d5cbd9 --- /dev/null +++ b/data/2024/aaai/You Only Read Once: Constituency-Oriented Relational Graph Convolutional Network for Multi-Aspect Multi-Sentiment Classification @@ -0,0 +1 @@ +Most of the existing aspect-based sentiment analysis (ABSA) models only predict the sentiment polarity of a single aspect at a time, focusing primarily on enhancing the representation of this single aspect based on the other contexts or aspects. This one-to-one paradigm ignores the fact that multi-aspect, multi-sentiment sentences contain not only distinct specific descriptions for distinct specific aspects, but also shared global context information for multiple aspects. To fully consider these issues, we propose a one-to-many ABSA framework, called You Only Read Once (YORO), that can simultaneously model representations of all aspects based on their specific descriptions and better fuse their relationships using globally shared contextual information in the sentence. Predicting the sentiment polarity of multiple aspects simultaneously is beneficial to improving the efficacy of calculation and prediction. Extensive experiments are conducted on three public datasets (MAMS, Rest14, and Lap14). Experimental results demonstrate the effectiveness of YORO in handling multi-aspect, multi-sentiment scenarios and highlight the promise of one-to-many ABSA in balancing efficiency and accuracy. \ No newline at end of file diff --git a/data/2024/aaai/Your Career Path Matters in Person-Job Fit b/data/2024/aaai/Your Career Path Matters in Person-Job Fit new file mode 100644 index 0000000000..6be113f3b5 --- /dev/null +++ b/data/2024/aaai/Your Career Path Matters in Person-Job Fit @@ -0,0 +1 @@ +We are again confronted with one of the most vexing aspects of the advancement of technology: automation and AI technology cause the devaluation of human labor, resulting in unemployment. With this background, automatic person-job fit systems are promising solutions to promote the employment rate. The purpose of person-job fit is to calculate a matching score between the job seeker's resume and the job posting, determining whether the job seeker is suitable for the position. In this paper, we propose a new approach to person-job fit that characterizes the hidden preference derived from the job seeker's career path. We categorize and utilize three types of preferences in the career path: consistency, likeness, and continuity. We prove that understanding the career path enables us to provide more appropriate career suggestions to job seekers. To demonstrate the practical value of our proposed model, we conduct extensive experiments on real-world data extracted from an online recruitment platform and then present detailed cases to show how the career path matter in person-job fit. \ No newline at end of file diff --git a/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint) b/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint) new file mode 100644 index 0000000000..1396aac0e5 --- /dev/null +++ b/data/2024/aaai/Your Prompt Is My Command: On Assessing the Human-Centred Generality of Multimodal Models (Abstract Reprint) @@ -0,0 +1 @@ +Even with obvious deficiencies, large prompt-commanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user— a system is only as general as it is effective for the user’s relevant tasks and their prevalent ways of prompting. A human-centred evaluation of general-purpose AI systems therefore needs to reflect the personal nature of interaction, tasks and cognition. We argue that the best way to understand these systems is as highly-coupled cognitive extenders, and to analyse the bidirectional cognitive adaptations between them and humans. In this paper, we give a formulation of HCG, as well as a high-level overview of the elements and trade-offs involved in the prompting process. We end the paper by outlining some essential research questions and suggestions for improving evaluation practices, which we envision as characteristic for the evaluation of general artificial intelligence in the future. \ No newline at end of file diff --git a/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization b/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization new file mode 100644 index 0000000000..6e59a37ffd --- /dev/null +++ b/data/2024/aaai/ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization @@ -0,0 +1 @@ +Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants. \ No newline at end of file diff --git a/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision b/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision new file mode 100644 index 0000000000..6120f8a829 --- /dev/null +++ b/data/2024/aaai/ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision @@ -0,0 +1 @@ +Mirror detection is an active research topic in computer vision. However, all existing mirror detectors learn mirror representations from large-scale pixel-wise datasets, which are tedious and expensive to obtain. Although weakly-supervised learning has been widely explored in related topics, we note that popular weak supervision signals (e.g., bounding boxes, scribbles, points) still require some efforts from the user to locate the target objects, with a strong assumption that the images to annotate always contain the target objects. Such an assumption may result in the over-segmentation of mirrors. Our key idea of this work is that the existence of mirrors over a time period may serve as a weak supervision to train a mirror detector, for two reasons. First, if a network can predict the existence of mirrors, it can essentially locate the mirrors. Second, we observe that the reflected contents of a mirror tend to be similar to those in adjacent frames, but exhibit considerable contrast to regions in far-away frames (e.g., non-mirror frames). To this end, in this paper, we propose ZOOM, the first method to learn robust mirror representations from extremely-weak annotations of per-frame ZerO-One Mirror indicators in videos. The key insight of ZOOM is to model the similarity and contrast (between mirror and non-mirror regions) in temporal variations to locate and segment the mirrors. To this end, we propose a novel fusion strategy to leverage temporal consistency information for mirror localization, and a novel temporal similarity-contrast modeling module for mirror segmentation. We construct a new video mirror dataset for training and evaluation. Experimental results under new and standard metrics show that ZOOM performs favorably against existing fully-supervised mirror detection methods. \ No newline at end of file diff --git a/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives b/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives new file mode 100644 index 0000000000..6bc8a23c57 --- /dev/null +++ b/data/2024/aaai/Zero-1-to-3: Domain-Level Zero-Shot Cognitive Diagnosis via One Batch of Early-Bird Students towards Three Diagnostic Objectives @@ -0,0 +1 @@ +Cognitive diagnosis seeks to estimate the cognitive states of students by exploring their logged practice quiz data. It plays a pivotal role in personalized learning guidance within intelligent education systems. In this paper, we focus on an important, practical, yet often underexplored task: domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the absence of student practice logs in newly launched domains. Recent cross-domain diagnostic models have been demonstrated to be a promising strategy for DZCD. These methods primarily focus on how to transfer student states across domains. However, they might inadvertently incorporate non-transferable information into student representations, thereby limiting the efficacy of knowledge transfer. To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive diagnosis framework via one batch of early-bird students towards three diagnostic objectives. Our approach initiates with pre-training a diagnosis model with dual regularizers, which decouples student states into domain-shared and domain-specific parts. The shared cognitive signals can be transferred to the target domain, enriching the cognitive priors for the new domain, which ensures the cognitive state propagation objective. Subsequently, we devise a strategy to generate simulated practice logs for cold-start students through analyzing the behavioral patterns from early-bird students, fulfilling the domain-adaption goal. Consequently, we refine the cognitive states of cold-start students as diagnostic outcomes via virtual data, aligning with the diagnosis-oriented goal. Finally, extensive experiments on six real-world datasets highlight the efficacy of our model for DZCD and its practical application in question recommendation. The code is publicly available at https://github.com/bigdata-ustc/Zero-1-to-3. \ No newline at end of file diff --git a/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization b/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization new file mode 100644 index 0000000000..eefc61de00 --- /dev/null +++ b/data/2024/aaai/Zero-Shot Aerial Object Detection with Visual Description Regularization @@ -0,0 +1,4 @@ +Existing object detection models are mainly trained on large-scale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-efficient object detection methods on aerial images. In this work, we propose a zero-shot method for aerial object detection named visual Description Regularization, or DescReg. +Concretely, we identify the weak semantic-visual correlation of the aerial objects and aim to address the challenge with prior descriptions of their visual appearance. Instead of directly encoding the descriptions into class embedding space which suffers from the representation gap problem, we propose to infuse the prior inter-class visual similarity conveyed in the descriptions into the embedding learning. The infusion process is accomplished with a newly designed similarity-aware triplet loss which incorporates structured regularization on the representation space. We conduct extensive experiments with three challenging aerial object detection datasets, including DIOR, xView, and DOTA. The results demonstrate that DescReg significantly outperforms the state-of-the-art ZSD methods with complex projection designs and generative frameworks, e.g., DescReg outperforms +best reported ZSD method on DIOR by 4.5 mAP on unseen classes and 8.1 in HM. We further show the generalizability of DescReg by integrating it into generative ZSD methods as well as varying the detection architecture. +Codes will be released at https://github.com/zq-zang/DescReg. \ No newline at end of file diff --git a/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information b/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information new file mode 100644 index 0000000000..03bf466e53 --- /dev/null +++ b/data/2024/aaai/Zero-Shot Task Adaptation with Relevant Feature Information @@ -0,0 +1 @@ +We propose a method to learn prediction models such as classifiers for unseen target tasks where labeled and unlabeled data are absent but a few relevant input features for solving the tasks are given. Although machine learning requires data for training, data are often difficult to collect in practice. On the other hand, for many applications, a few relevant features would be more easily obtained. Although zero-shot learning or zero-shot domain adaptation use external knowledge to adapt to unseen classes or tasks without data, relevant features have not been used in existing studies. The proposed method improves the generalization performance on the target tasks, where there are no data but a few relevant features are given, by meta-learning from labeled data in related tasks. In the meta-learning phase, it is essential to simulate test phases on target tasks where prediction model learning is required without data. To this end, our neural network-based prediction model is meta-learned such that it correctly responds to perturbations of the relevant features on randomly generated synthetic data. By this modeling, the prediction model can explicitly learn the discriminability of the relevant features without real target data. When unlabeled training data are available in the target tasks, the proposed method can incorporate such data to boost the performance in a unified framework. Our experiments demonstrate that the proposed method outperforms various existing methods with four real-world datasets. \ No newline at end of file diff --git a/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing b/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing new file mode 100644 index 0000000000..b623b9e77c --- /dev/null +++ b/data/2024/aaai/Zero-Sum Games between Mean-Field Teams: Reachability-Based Analysis under Mean-Field Sharing @@ -0,0 +1 @@ +This work studies the behaviors of two large-population teams competing in a discrete environment. The team-level interactions are modeled as a zero-sum game while the agent dynamics within each team is formulated as a collaborative mean-field team problem. Drawing inspiration from the mean-field literature, we first approximate the large-population team game with its infinite-population limit. Subsequently, we construct a fictitious centralized system and transform the infinite-population game to an equivalent zero-sum game between two coordinators. Via a novel reachability analysis, we study the optimality of coordination strategies, which induce decentralized strategies under the original information structure. The optimality of the resulting strategies is established in the original finite-population game, and the theoretical guarantees are verified by numerical examples. \ No newline at end of file diff --git a/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue b/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue new file mode 100644 index 0000000000..aea73b78d9 --- /dev/null +++ b/data/2024/aaai/Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue @@ -0,0 +1 @@ +Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot align responses with experts' intentions. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF). Additionally, we construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We also define a refined annotation rule and evaluation criteria given the unique characteristics of the biomedical domain. Extensive experimental results show that Zhongjing outperforms baselines in various capacities and matches the performance of ChatGPT in some abilities, despite the 100x parameters. Ablation studies also demonstrate the contributions of each component: pre-training enhances medical knowledge, and RLHF further improves instruction-following ability and safety. Our code, datasets, and models are available at https://github.com/SupritYoung/Zhongjing. \ No newline at end of file diff --git a/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation b/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation new file mode 100644 index 0000000000..687211eeeb --- /dev/null +++ b/data/2024/aaai/eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation @@ -0,0 +1 @@ +Class incremental learning (CIL) aims to solve the notorious forgetting problem, which refers to the fact that once the network is updated on a new task, its performance on previously-learned tasks degenerates catastrophically. Most successful CIL methods store exemplars (samples of learned tasks) to train a feature extractor incrementally, or store prototypes (features of learned tasks) to estimate the incremental feature distribution. However, the stored exemplars would violate the data privacy concerns, while the fixed prototypes might not reasonably be consistent with the incremental feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a data-free CIL method with embedding distillation and Task-oriented generation (eTag), which requires neither exemplar nor prototype. Embedding distillation prevents the feature extractor from forgetting by distilling the outputs from the networks' intermediate blocks. Task-oriented generation enables a lightweight generator to produce dynamic features, fitting the needs of the top incremental classifier. Experimental results confirm that the proposed eTag considerably outperforms state-of-the-art methods on several benchmark datasets. \ No newline at end of file diff --git a/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds b/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds new file mode 100644 index 0000000000..5afa2ec12a --- /dev/null +++ b/data/2024/aaai/iDet3D: Towards Efficient Interactive Object Detection for LiDAR Point Clouds @@ -0,0 +1 @@ +Accurately annotating multiple 3D objects in LiDAR scenes is laborious and challenging. While a few previous studies have attempted to leverage semi-automatic methods for cost-effective bounding box annotation, such methods have limitations in efficiently handling numerous multi-class objects. To effectively accelerate 3D annotation pipelines, we propose iDet3D, an efficient interactive 3D object detector. Supporting a user-friendly 2D interface, which can ease the cognitive burden of exploring 3D space to provide click interactions, iDet3D enables users to annotate the entire objects in each scene with minimal interactions. Taking the sparse nature of 3D point clouds into account, we design a negative click simulation (NCS) to improve accuracy by reducing false-positive predictions. In addition, iDet3D incorporates two click propagation techniques to take full advantage of user interactions: (1) dense click guidance (DCG) for keeping user-provided information throughout the network and (2) spatial click propagation (SCP) for detecting other instances of the same class based on the user-specified objects. Through our extensive experiments, we present that our method can construct precise annotations in a few clicks, which shows the practicality as an efficient annotation tool for 3D object detection. \ No newline at end of file diff --git a/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction b/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction new file mode 100644 index 0000000000..844c7fc1ed --- /dev/null +++ b/data/2024/aaai/iTrendRNN: An Interpretable Trend-Aware RNN for Meteorological Spatiotemporal Prediction @@ -0,0 +1 @@ +Accurate prediction of meteorological elements, such as temperature and relative humidity, is important to human livelihood, early warning of extreme weather, and urban governance. Recently, neural network-based methods have shown impressive performance in this field. However, most of them are overcomplicated and impenetrable. In this paper, we propose a straightforward and interpretable differential framework, where the key lies in explicitly estimating the evolutionary trends. Specifically, three types of trends are exploited. (1) The proximity trend simply uses the most recent changes. It works well for approximately linear evolution. (2) The sequential trend explores the global information, aiming to capture the nonlinear dynamics. Here, we develop an attention-based trend unit to help memorize long-term features. (3) The flow trend is motivated by the nature of evolution, i.e., the heat or substance flows from one region to another. Here, we design a flow-aware attention unit. It can reflect the interactions via performing spatial attention over flow maps. Finally, we develop a trend fusion module to adaptively fuse the above three trends. Extensive experiments on two datasets demonstrate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract) b/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract) new file mode 100644 index 0000000000..5171735e5a --- /dev/null +++ b/data/2024/aaai/icsPLMs: Exploring Pre-trained Language Models in Intelligent Customer Service (Student Abstract) @@ -0,0 +1 @@ +Pre-trained language models have shown their high performance of text processing in intelligent customer service platforms. However, these models do not leverage domain specific information. In this paper, we propose icsPLMs optimized for intelligent customer service on both word and sentence levels. Our experimental results represent that using targeted strategies can further improve the performance of pre-trained language models in this field. \ No newline at end of file diff --git a/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression b/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression new file mode 100644 index 0000000000..09c453e9b9 --- /dev/null +++ b/data/2024/aaai/msLPCC: A Multimodal-Driven Scalable Framework for Deep LiDAR Point Cloud Compression @@ -0,0 +1 @@ +LiDAR sensors are widely used in autonomous driving, and the growing storage and transmission demands have made LiDAR point cloud compression (LPCC) a hot research topic. To address the challenges posed by the large-scale and uneven-distribution (spatial and categorical) of LiDAR point data, this paper presents a new multimodal-driven scalable LPCC framework. For the large-scale challenge, we decouple the original LiDAR data into multi-layer point subsets, compress and transmit each layer separately, so as to ensure the reconstruction quality requirement under different scenarios. For the uneven-distribution challenge, we extract, align, and fuse heterologous feature representations, including point modality with position information, depth modality with spatial distance information, and segmentation modality with category information. Extensive experimental results on the benchmark SemanticKITTI database validate that our method outperforms 14 recent representative LPCC methods. \ No newline at end of file diff --git a/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models b/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models new file mode 100644 index 0000000000..8685cafab9 --- /dev/null +++ b/data/2024/aaai/p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models @@ -0,0 +1,4 @@ +Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. In light of the rapidly increasing size of pre-trained VLMs, parameter-efficient transfer learning (PETL) has garnered attention as a viable alternative to full fine-tuning. One such approach is the adapter, which introduces a few trainable parameters into the pre-trained models while preserving the original parameters during adaptation. +In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively. Within this framework, tuning adapters in VLMs necessitates handling heterophilic graphs, owing to the disparity between the projected query and value space. +To address this challenge, we propose a new adapter architecture, p-adapter, which employs p-Laplacian message passing in Graph Neural Networks (GNNs). Specifically, the attention weights are re-normalized based on the features, and the features are then aggregated using the calibrated attention matrix, enabling the dynamic exploitation of information with varying frequencies in the heterophilic attention graphs. +We conduct extensive experiments on different pre-trained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method's significant superiority over other PETL methods. Our code is available at https://github.com/wuhy68/p-Adapter/. \ No newline at end of file diff --git a/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds b/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds new file mode 100644 index 0000000000..46ec6e3baf --- /dev/null +++ b/data/2024/aaai/patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds @@ -0,0 +1 @@ +When compressing point clouds, point-based deep learning models operate points in a continuous space, which has a chance to minimize the geometric fidelity loss introduced by voxelization in preprocessing. But these methods could hardly scale to inputs with arbitrary points. Furthermore, the point cloud frames are individually compressed, failing the conventional wisdom of leveraging inter-frame similarity. In this work, we propose a patchwise compression framework called patchDPCC, which consists of a patch group generation module and a point-based compression model. Algorithms are developed to generate patches from different frames representing the same object, and more importantly, these patches are regulated to have the same number of points. We also incorporate a feature transfer module in the compression model, which refines the feature quality by exploiting the inter-frame similarity. Our model generates point-wise features for entropy coding, which guarantees the reconstruction speed. The evaluation on the MPEG 8i dataset shows that our method improves the compression ratio by 47.01% and 85.22% when compared to PCGCv2 and V-PCC with the same reconstruction quality, which is 9% and 16% better than that D-DPCC does. Our method also achieves the fastest decoding speed among the learning-based compression models. \ No newline at end of file diff --git a/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population b/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population new file mode 100644 index 0000000000..a8f1f85c3c --- /dev/null +++ b/data/2024/aaai/s-ID: Causal Effect Identification in a Sub-population @@ -0,0 +1 @@ +Causal inference in a sub-population involves identifying the causal effect of an intervention on a specific subgroup, which is distinguished from the whole population through the influence of systematic biases in the sampling process. However, ignoring the subtleties introduced by sub-populations can either lead to erroneous inference or limit the applicability of existing methods. We introduce and advocate for a causal inference problem in sub-populations (henceforth called s-ID), in which we merely have access to observational data of the targeted sub-population (as opposed to the entire population). Existing inference problems in sub-populations operate on the premise that the given data distributions originate from the entire population, thus, cannot tackle the s-ID problem. To address this gap, we provide necessary and sufficient conditions that must hold in the causal graph for a causal effect in a sub-population to be identifiable from the observational distribution of that sub-population. Given these conditions, we present a sound and complete algorithm for the s-ID problem. \ No newline at end of file diff --git a/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning b/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning new file mode 100644 index 0000000000..e99b25d049 --- /dev/null +++ b/data/2024/aaai/z-SignFedAvg: A Unified Stochastic Sign-Based Compression for Federated Learning @@ -0,0 +1,26 @@ +Federated Learning (FL) is a promising privacy-preserving +distributed learning paradigm but suffers from high communi- +cation cost when training large-scale machine learning models. +Sign-based methods, such as SignSGD, have been proposed +as a biased gradient compression technique for reducing the +communication cost. However, sign-based algorithms could +diverge under heterogeneous data, which thus motivated the de- +velopment of advanced techniques, such as the error-feedback +method and stochastic sign-based compression, to fix this +issue. Nevertheless, these methods still suffer from slower +convergence rates, and none of them allows multiple local +SGD updates like FedAvg. In this paper, we propose a novel +noisy perturbation scheme with a general symmetric noise +distribution for sign-based compression, which not only al- +lows one to flexibly control the bias-variance tradeoff for the +compressed gradient, but also provides a unified viewpoint +to existing stochastic sign-based methods. More importantly, +the proposed scheme enables the development of the very first +sign-based FedAvg algorithm (z-SignFedAvg) to accelerate +the convergence. Theoretically, we show that z-SignFedAvg +achieves a faster convergence rate than existing sign-based +methods and, under the uniformly distributed noise, can enjoy +the same convergence rate as its uncompressed counterpart. +Extensive experiments are conducted to demonstrate that the +z-SignFedAvg can achieve competitive empirical performance +on real datasets and outperforms existing schemes. \ No newline at end of file diff --git "a/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control" "b/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control" new file mode 100644 index 0000000000..210048f301 --- /dev/null +++ "b/data/2024/aaai/\317\200-Light: Programmatic Interpretable Reinforcement Learning for Resource-Limited Traffic Signal Control" @@ -0,0 +1 @@ +The recent advancements in Deep Reinforcement Learning (DRL) have significantly enhanced the performance of adaptive Traffic Signal Control (TSC). However, DRL policies are typically represented by neural networks, which are over-parameterized black-box models. As a result, the learned policies often lack interpretability, and cannot be deployed directly in the real-world edge hardware due to resource constraints. In addition, the DRL methods often exhibit limited generalization performance, struggling to generalize the learned policy to other geographical regions. These factors limit the practical application of learning-based approaches. To address these issues, we suggest the use of an inherently interpretable program for representing the control policy. We present a new approach, Programmatic Interpretable reinforcement learning for traffic signal control (π-light), designed to autonomously discover non-differentiable programs. Specifically, we define a Domain Specific Language (DSL) and transformation rules for constructing programs, and utilize Monte Carlo Tree Search (MCTS) to find the optimal program in a discrete space. Extensive experiments demonstrate that our method consistently outperforms baseline approaches. Moreover, π-Light exhibits superior generalization capabilities compared to DRL, enabling training and evaluation across intersections from different cities. Finally, we analyze how the learned program policies can directly deploy on edge devices with extremely limited resources. \ No newline at end of file diff --git a/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining b/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining new file mode 100644 index 0000000000..dbf7e4e6b3 --- /dev/null +++ b/data/2024/iclr/3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining @@ -0,0 +1 @@ +Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks. \ No newline at end of file diff --git a/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation b/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation new file mode 100644 index 0000000000..b3627d5a25 --- /dev/null +++ b/data/2024/iclr/3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation @@ -0,0 +1 @@ +Prior methods that tackle the problem of generalizable object pose estimation highly rely on having dense views of the unseen object. By contrast, we address the scenario where only a single reference view of the object is available. Our goal then is to estimate the relative object pose between this reference view and a query image that depicts the object in a different pose. In this scenario, robust generalization is imperative due to the presence of unseen objects during testing and the large-scale object pose variation between the reference and the query. To this end, we present a new hypothesis-and-verification framework, in which we generate and evaluate multiple pose hypotheses, ultimately selecting the most reliable one as the relative object pose. To measure reliability, we introduce a 3D-aware verification that explicitly applies 3D transformations to the 3D object representations learned from the two input images. Our comprehensive experiments on the Objaverse, LINEMOD, and CO3D datasets evidence the superior accuracy of our approach in relative pose estimation and its robustness in large-scale pose variations, when dealing with unseen objects. \ No newline at end of file diff --git a/data/2024/iclr/A 2-Dimensional State Space Layer for Spatial Inductive Bias b/data/2024/iclr/A 2-Dimensional State Space Layer for Spatial Inductive Bias new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Benchmark Study on Calibration b/data/2024/iclr/A Benchmark Study on Calibration new file mode 100644 index 0000000000..c65e46c3ca --- /dev/null +++ b/data/2024/iclr/A Benchmark Study on Calibration @@ -0,0 +1 @@ +Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS. The project page can be found at https://www.taolinwei.com/calibration-study \ No newline at end of file diff --git a/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book b/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book new file mode 100644 index 0000000000..5d0f47e1ef --- /dev/null +++ b/data/2024/iclr/A Benchmark for Learning to Translate a New Language from One Grammar Book @@ -0,0 +1 @@ +Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation. \ No newline at end of file diff --git a/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning b/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning new file mode 100644 index 0000000000..4d2d6ef34a --- /dev/null +++ b/data/2024/iclr/A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning @@ -0,0 +1 @@ +We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve $\widetilde{O}\left(\Delta^{1/4}T^{3/4}\right)$ regret when the degree of nonstationarity, as measured by total variation $\Delta$, is known, and $\widetilde{O}\left(\Delta^{1/5}T^{4/5}\right)$ regret when $\Delta$ is unknown, where $T$ is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria. \ No newline at end of file diff --git a/data/2024/iclr/A Branching Decoder for Set Generation b/data/2024/iclr/A Branching Decoder for Set Generation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations b/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations new file mode 100644 index 0000000000..264b6dc5db --- /dev/null +++ b/data/2024/iclr/A Characterization Theorem for Equivariant Networks with Point-wise Activations @@ -0,0 +1 @@ +Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possible combinations of finite-dimensional representations, choice of coordinates and point-wise activations to obtain an exactly equivariant layer, generalizing and strengthening existing characterizations. Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of exactly equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations. \ No newline at end of file diff --git a/data/2024/iclr/A Cognitive Model for Learning Abstract Relational Structures from Memory-based Decision-Making Tasks b/data/2024/iclr/A Cognitive Model for Learning Abstract Relational Structures from Memory-based Decision-Making Tasks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection b/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection new file mode 100644 index 0000000000..d083a21d03 --- /dev/null +++ b/data/2024/iclr/A Data-Driven Measure of Relative Uncertainty for Misclassification Detection @@ -0,0 +1 @@ +Misclassification detection is an important problem in machine learning, as it allows for the identification of instances where the model's predictions are unreliable. However, conventional uncertainty measures such as Shannon entropy do not provide an effective way to infer the real uncertainty associated with the model's predictions. In this paper, we introduce a novel data-driven measure of uncertainty relative to an observer for misclassification detection. By learning patterns in the distribution of soft-predictions, our uncertainty measure can identify misclassified samples based on the predicted class probabilities. Interestingly, according to the proposed measure, soft-predictions corresponding to misclassified instances can carry a large amount of uncertainty, even though they may have low Shannon entropy. We demonstrate empirical improvements over multiple image classification tasks, outperforming state-of-the-art misclassification detection methods. \ No newline at end of file diff --git a/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs b/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs new file mode 100644 index 0000000000..43461c20d4 --- /dev/null +++ b/data/2024/iclr/A Differentially Private Clustering Algorithm for Well-Clustered Graphs @@ -0,0 +1 @@ +We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering. We provide an efficient ($\epsilon$,$\delta$)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) $\epsilon$-DP algorithm would result in substantial error. \ No newline at end of file diff --git a/data/2024/iclr/A Discretization Framework for Robust Contextual Stochastic Optimization b/data/2024/iclr/A Discretization Framework for Robust Contextual Stochastic Optimization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Dynamical View of the Question of Why b/data/2024/iclr/A Dynamical View of the Question of Why new file mode 100644 index 0000000000..71d3a6c308 --- /dev/null +++ b/data/2024/iclr/A Dynamical View of the Question of Why @@ -0,0 +1 @@ +We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions and frame them as reinforcement learning problems. Our approach offers formal and computational tools for uncovering and quantifying causal relationships in diffusion processes, subsuming various important settings such as discrete-time Markov decision processes. Finally, in fairly intricate experiments and through sheer learning, our framework reveals and quantifies causal links, which otherwise seem inexplicable. \ No newline at end of file diff --git a/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval b/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval new file mode 100644 index 0000000000..3cfd710a27 --- /dev/null +++ b/data/2024/iclr/A Fast and Provable Algorithm for Sparse Phase Retrieval @@ -0,0 +1 @@ +We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations of first-order methods while preserving their hallmark per-iteration computational efficiency. We provide theoretical guarantees that our algorithm converges to the $s$-sparse ground truth signal $\mathbf{x}^{\natural} \in \mathbb{R}^n$ (up to a global sign) at a quadratic convergence rate after at most $O(\log (\Vert\mathbf{x}^{\natural} \Vert /x_{\min}^{\natural}))$ iterations, using $\Omega(s^2\log n)$ Gaussian random samples. Numerical experiments show that our algorithm achieves a significantly faster convergence rate than state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/iclr/A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality b/data/2024/iclr/A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Foundation Model for Error Correction Codes b/data/2024/iclr/A Foundation Model for Error Correction Codes new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms b/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms new file mode 100644 index 0000000000..d2a44422d9 --- /dev/null +++ b/data/2024/iclr/A Framework for Inference Inspired by Human Memory Mechanisms @@ -0,0 +1 @@ +How humans and machines make sense of current inputs for relation reasoning and question-answering while putting the perceived information into context of our past memories, has been a challenging conundrum in cognitive science and artificial intelligence. Inspired by human brain's memory system and cognitive architectures, we propose a PMI framework that consists of perception, memory and inference components. Notably, the memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain extensive and complex relational knowledge and experience. Through a differentiable competitive write access, current perceptions update working memory, which is later merged with long-term memory via outer product associations, reducing information conflicts and averting memory overflow. In the inference module, relevant information is retrieved from two separate memory origins and associatively integrated to attain a more comprehensive and precise interpretation of current perceptions. We exploratively apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets, as well as detecting equilateral triangles, language modeling and image classification tasks, and in each case, our PMI enhancements consistently outshine their original counterparts significantly. Visualization analyses reveal that relational memory consolidation, along with the interaction and integration of information from diverse memory sources, substantially contributes to the model effectiveness on inference tasks. \ No newline at end of file diff --git a/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization b/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization new file mode 100644 index 0000000000..0b8ab8b8d4 --- /dev/null +++ b/data/2024/iclr/A General Framework for User-Guided Bayesian Optimization @@ -0,0 +1 @@ +The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading. \ No newline at end of file diff --git a/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation b/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation new file mode 100644 index 0000000000..59d4dd12bd --- /dev/null +++ b/data/2024/iclr/A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation @@ -0,0 +1 @@ +Knowledge distillation (KD) is a technique used to transfer knowledge from a larger “teacher” model into a smaller “student” model. Recent advancements in meta-learning-based knowledge distillation (MetaKD) emphasize that the fine-tuning of teacher models should be aware of the student’s need to achieve better knowledge distillation. However, existing MetaKD methods often lack incentives for the teacher model to improve itself. In this study, we introduce MPDistil , a meta-policy distillation technique, that utilizes novel optimization strategies to foster both collaboration and competition during the fine-tuning of the teacher model in the meta-learning step. Additionally, we propose a curriculum learning framework for the student model in a competitive setup, in which the student model aims to outperform the teacher model by self-training on various tasks. Exhaustive experiments on SuperGLUE and GLUE benchmarks demonstrate the efficacy of MPDistil compared to 20 conventional KD and advanced MetaKD baselines, showing significant performance enhancements in the student model – e.g., a distilled 6-layer BERT model outperforms a 12-layer BERT model on five out of six SuperGLUE tasks. Furthermore, MPDistil , while applied to a large language teacher model (DeBERTa-v2-xxlarge), significantly narrows the performance gap of its smaller student counterpart (DeBERTa-12) by just 4 . 6% on SuperGLUE. We further demonstrate how higher rewards and customized training curricula strengthen the student model and enhance generalizability. \ No newline at end of file diff --git a/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks b/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks new file mode 100644 index 0000000000..3104e90151 --- /dev/null +++ b/data/2024/iclr/A Graph is Worth 1-bit Spikes: When Graph Contrastive Learning Meets Spiking Neural Networks @@ -0,0 +1 @@ +While contrastive self-supervised learning has become the de-facto learning paradigm for graph neural networks, the pursuit of higher task accuracy requires a larger hidden dimensionality to learn informative and discriminative full-precision representations, raising concerns about computation, memory footprint, and energy consumption burden (largely overlooked) for real-world applications. This work explores a promising direction for graph contrastive learning (GCL) with spiking neural networks (SNNs), which leverage sparse and binary characteristics to learn more biologically plausible and compact representations. We propose SpikeGCL, a novel GCL framework to learn binarized 1-bit representations for graphs, making balanced trade-offs between efficiency and performance. We provide theoretical guarantees to demonstrate that SpikeGCL has comparable expressiveness with its full-precision counterparts. Experimental results demonstrate that, with nearly 32x representation storage compression, SpikeGCL is either comparable to or outperforms many fancy state-of-the-art supervised and self-supervised methods across several graph benchmarks. \ No newline at end of file diff --git a/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation b/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation new file mode 100644 index 0000000000..cad1f745f0 --- /dev/null +++ b/data/2024/iclr/A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation @@ -0,0 +1 @@ +Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}. \ No newline at end of file diff --git a/data/2024/iclr/A Hierarchical Bayesian Model for Few-Shot Meta Learning b/data/2024/iclr/A Hierarchical Bayesian Model for Few-Shot Meta Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization b/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization new file mode 100644 index 0000000000..700085e11a --- /dev/null +++ b/data/2024/iclr/A Lie Group Approach to Riemannian Batch Normalization @@ -0,0 +1 @@ +Manifold-valued measurements exist in numerous applications within computer vision and machine learning. Recent studies have extended Deep Neural Networks (DNNs) to manifolds, and concomitantly, normalization techniques have also been adapted to several manifolds, referred to as Riemannian normalization. Nonetheless, most of the existing Riemannian normalization methods have been derived in an ad hoc manner and only apply to specific manifolds. This paper establishes a unified framework for Riemannian Batch Normalization (RBN) techniques on Lie groups. Our framework offers the theoretical guarantee of controlling both the Riemannian mean and variance. Empirically, we focus on Symmetric Positive Definite (SPD) manifolds, which possess three distinct types of Lie group structures. Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups. Specific normalization layers induced by these Lie groups are then proposed for SPD neural networks. We demonstrate the effectiveness of our approach through three sets of experiments: radar recognition, human action recognition, and electroencephalography (EEG) classification. The code is available at https://github.com/GitZH-Chen/LieBN.git. \ No newline at end of file diff --git a/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging b/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging new file mode 100644 index 0000000000..c904b8c184 --- /dev/null +++ b/data/2024/iclr/A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging @@ -0,0 +1 @@ +In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns. \ No newline at end of file diff --git a/data/2024/iclr/A Linear Algebraic Framework for Counterfactual Generation b/data/2024/iclr/A Linear Algebraic Framework for Counterfactual Generation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models b/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models new file mode 100644 index 0000000000..f36999a4ca --- /dev/null +++ b/data/2024/iclr/A Multi-Level Framework for Accelerating Training Transformer Models @@ -0,0 +1 @@ +The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance. \ No newline at end of file diff --git a/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning b/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning new file mode 100644 index 0000000000..99ac0f6850 --- /dev/null +++ b/data/2024/iclr/A Mutual Information Perspective on Federated Contrastive Learning @@ -0,0 +1 @@ +We investigate contrastive learning in the federated setting through the lens of SimCLR and multi-view mutual information maximization. In doing so, we uncover a connection between contrastive representation learning and user verification; by adding a user verification loss to each client's local SimCLR loss we recover a lower bound to the global multi-view mutual information. To accommodate for the case of when some labelled data are available at the clients, we extend our SimCLR variant to the federated semi-supervised setting. We see that a supervised SimCLR objective can be obtained with two changes: a) the contrastive loss is computed between datapoints that share the same label and b) we require an additional auxiliary head that predicts the correct labels from either of the two views. Along with the proposed SimCLR extensions, we also study how different sources of non-i.i.d.-ness can impact the performance of federated unsupervised learning through global mutual information maximization; we find that a global objective is beneficial for some sources of non-i.i.d.-ness but can be detrimental for others. We empirically evaluate our proposed extensions in various tasks to validate our claims and furthermore demonstrate that our proposed modifications generalize to other pretraining methods. \ No newline at end of file diff --git a/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis b/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis new file mode 100644 index 0000000000..bda680e10a --- /dev/null +++ b/data/2024/iclr/A Neural Framework for Generalized Causal Sensitivity Analysis @@ -0,0 +1 @@ +Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of NeuralCSA is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data. \ No newline at end of file diff --git a/data/2024/iclr/A Newborn Embodied Turing Test for Comparing Object Segmentation Across Animals and Machines b/data/2024/iclr/A Newborn Embodied Turing Test for Comparing Object Segmentation Across Animals and Machines new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models b/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models new file mode 100644 index 0000000000..b4e445417d --- /dev/null +++ b/data/2024/iclr/A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models @@ -0,0 +1 @@ +Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation. \ No newline at end of file diff --git a/data/2024/iclr/A Plug-and-Play Image Registration Network b/data/2024/iclr/A Plug-and-Play Image Registration Network new file mode 100644 index 0000000000..50975ef848 --- /dev/null +++ b/data/2024/iclr/A Plug-and-Play Image Registration Network @@ -0,0 +1 @@ +Deformable image registration (DIR) is an active research topic in biomedical imaging. There is a growing interest in developing DIR methods based on deep learning (DL). A traditional DL approach to DIR is based on training a convolutional neural network (CNN) to estimate the registration field between two input images. While conceptually simple, this approach comes with a limitation that it exclusively relies on a pre-trained CNN without explicitly enforcing fidelity between the registered image and the reference. We present plug-and-play image registration network (PIRATE) as a new DIR method that addresses this issue by integrating an explicit data-fidelity penalty and a CNN prior. PIRATE pre-trains a CNN denoiser on the registration field and"plugs"it into an iterative method as a regularizer. We additionally present PIRATE+ that fine-tunes the CNN prior in PIRATE using deep equilibrium models (DEQ). PIRATE+ interprets the fixed-point iteration of PIRATE as a network with effectively infinite layers and then trains the resulting network end-to-end, enabling it to learn more task-specific information and boosting its performance. Our numerical results on OASIS and CANDI datasets show that our methods achieve state-of-the-art performance on DIR. \ No newline at end of file diff --git "a/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs" "b/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs" new file mode 100644 index 0000000000..1ee6bf4d8e --- /dev/null +++ "b/data/2024/iclr/A Poincar\303\251 Inequality and Consistency Results for Signal Sampling on Large Graphs" @@ -0,0 +1 @@ +Large-scale graph machine learning is challenging as the complexity of learning models scales with the graph size. Subsampling the graph is a viable alternative, but sampling on graphs is nontrivial as graphs are non-Euclidean. Existing graph sampling techniques require not only computing the spectra of large matrices but also repeating these computations when the graph changes, e.g., grows. In this paper, we introduce a signal sampling theory for a type of graph limit -- the graphon. We prove a Poincar\'e inequality for graphon signals and show that complements of node subsets satisfying this inequality are unique sampling sets for Paley-Wiener spaces of graphon signals. Exploiting connections with spectral clustering and Gaussian elimination, we prove that such sampling sets are consistent in the sense that unique sampling sets on a convergent graph sequence converge to unique sampling sets on the graphon. We then propose a related graphon signal sampling algorithm for large graphs, and demonstrate its good empirical performance on graph machine learning tasks. \ No newline at end of file diff --git a/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs b/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs new file mode 100644 index 0000000000..32f35bdc35 --- /dev/null +++ b/data/2024/iclr/A Policy Gradient Method for Confounded POMDPs @@ -0,0 +1 @@ +In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting. \ No newline at end of file diff --git a/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry b/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry new file mode 100644 index 0000000000..a4637da8a8 --- /dev/null +++ b/data/2024/iclr/A Precise Characterization of SGD Stability Using Loss Surface Geometry @@ -0,0 +1 @@ +Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss. \ No newline at end of file diff --git a/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints b/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints new file mode 100644 index 0000000000..263631c66d --- /dev/null +++ b/data/2024/iclr/A Primal-Dual Approach to Solving Variational Inequalities with General Constraints @@ -0,0 +1 @@ +Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of $O(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions. Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is $L$-Lipschitz. \ No newline at end of file diff --git a/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning b/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning new file mode 100644 index 0000000000..291c847966 --- /dev/null +++ b/data/2024/iclr/A Probabilistic Framework for Modular Continual Learning @@ -0,0 +1 @@ +Modular approaches that use a different composition of modules for each problem are a promising direction in continual learning (CL). However, searching through the large, discrete space of module compositions is challenging, especially because evaluating a composition's performance requires a round of neural network training. We address this challenge through a modular CL framework, PICLE, that uses a probabilistic model to cheaply compute the fitness of each composition, allowing PICLE to achieve both perceptual, few-shot and latent transfer. The model combines prior knowledge about good module compositions with dataset-specific information. We evaluate PICLE using two benchmark suites designed to assess different desiderata of CL techniques. Comparing to a wide range of approaches, we show that PICLE is the first modular CL algorithm to achieve perceptual, few-shot and latent transfer while scaling well to large search spaces, outperforming previous state-of-the-art modular CL approaches on long problem sequences. \ No newline at end of file diff --git a/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning b/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning new file mode 100644 index 0000000000..8ba6826f1b --- /dev/null +++ b/data/2024/iclr/A Quadratic Synchronization Rule for Distributed Deep Learning @@ -0,0 +1 @@ +In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy. \ No newline at end of file diff --git a/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis b/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis new file mode 100644 index 0000000000..d68426edea --- /dev/null +++ b/data/2024/iclr/A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis @@ -0,0 +1 @@ +Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation. \ No newline at end of file diff --git a/data/2024/iclr/A Recipe for Improved Certifiable Robustness b/data/2024/iclr/A Recipe for Improved Certifiable Robustness new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Restoration Network as an Implicit Prior b/data/2024/iclr/A Restoration Network as an Implicit Prior new file mode 100644 index 0000000000..d3e3a47866 --- /dev/null +++ b/data/2024/iclr/A Restoration Network as an Implicit Prior @@ -0,0 +1 @@ +Image denoisers have been shown to be powerful priors for solving inverse problems in imaging. In this work, we introduce a generalization of these methods that allows any image restoration network to be used as an implicit prior. The proposed method uses priors specified by deep neural networks pre-trained as general restoration operators. The method provides a principled approach for adapting state-of-the-art restoration models for other inverse problems. Our theoretical result analyzes its convergence to a stationary point of a global functional associated with the restoration operator. Numerical results show that the method using a super-resolution prior achieves state-of-the-art performance both quantitatively and qualitatively. Overall, this work offers a step forward for solving inverse problems by enabling the use of powerful pre-trained restoration models as priors. \ No newline at end of file diff --git a/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models b/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models new file mode 100644 index 0000000000..b1e2b71076 --- /dev/null +++ b/data/2024/iclr/A Semantic Invariant Robust Watermark for Large Language Models @@ -0,0 +1 @@ +Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}. \ No newline at end of file diff --git a/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis b/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis new file mode 100644 index 0000000000..0fbd8922b6 --- /dev/null +++ b/data/2024/iclr/A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis @@ -0,0 +1 @@ +We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn"class-specific"queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via"multi-head"cross-attention, INTR could identify different"attributes"of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained models are publicly accessible at the Imageomics Institute GitHub site: https://github.com/Imageomics/INTR. \ No newline at end of file diff --git a/data/2024/iclr/A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction b/data/2024/iclr/A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models b/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models new file mode 100644 index 0000000000..c15a0e63b6 --- /dev/null +++ b/data/2024/iclr/A Simple and Effective Pruning Approach for Large Language Models @@ -0,0 +1 @@ +As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda. \ No newline at end of file diff --git a/data/2024/iclr/A Simple and Scalable Representation for Graph Generation b/data/2024/iclr/A Simple and Scalable Representation for Graph Generation new file mode 100644 index 0000000000..88f577f4d8 --- /dev/null +++ b/data/2024/iclr/A Simple and Scalable Representation for Graph Generation @@ -0,0 +1 @@ +Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL. \ No newline at end of file diff --git a/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks b/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks new file mode 100644 index 0000000000..10d7e7483c --- /dev/null +++ b/data/2024/iclr/A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks @@ -0,0 +1 @@ +Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience. Training such models, however, is quite inefficient and unstable. In this work, we show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one, and has theoretical guarantees in terms of convergence. The proposed algorithm, that we call incremental predictive coding (iPC) is also more biologically plausible than the original one, as it it fully automatic. In an extensive set of experiments, we show that iPC constantly performs better than the original formulation on a large number of benchmarks for image classification, as well as for the training of both conditional and masked language models, in terms of test accuracy, efficiency, and convergence with respect to a large set of hyperparameters. \ No newline at end of file diff --git a/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data b/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data new file mode 100644 index 0000000000..fbc604fa98 --- /dev/null +++ b/data/2024/iclr/A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data @@ -0,0 +1 @@ +Variational Autoencoders (VAEs) have gained significant popularity among researchers as a powerful tool for understanding unknown distributions based on limited samples. This popularity stems partly from their impressive performance and partly from their ability to provide meaningful feature representations in the latent space. Wasserstein Autoencoders (WAEs), a variant of VAEs, aim to not only improve model efficiency but also interpretability. However, there has been limited focus on analyzing their statistical guarantees. The matter is further complicated by the fact that the data distributions to which WAEs are applied - such as natural images - are often presumed to possess an underlying low-dimensional structure within a high-dimensional feature space, which current theory does not adequately account for, rendering known bounds inefficient. To bridge the gap between the theory and practice of WAEs, in this paper, we show that WAEs can learn the data distributions when the network architectures are properly chosen. We show that the convergence rates of the expected excess risk in the number of samples for WAEs are independent of the high feature dimension, instead relying only on the intrinsic dimension of the data distribution. \ No newline at end of file diff --git a/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization b/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization new file mode 100644 index 0000000000..34e7a23050 --- /dev/null +++ b/data/2024/iclr/A Study of Bayesian Neural Network Surrogates for Bayesian Optimization @@ -0,0 +1 @@ +Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions. \ No newline at end of file diff --git a/data/2024/iclr/A Sublinear Adversarial Training Algorithm b/data/2024/iclr/A Sublinear Adversarial Training Algorithm new file mode 100644 index 0000000000..b8040107b3 --- /dev/null +++ b/data/2024/iclr/A Sublinear Adversarial Training Algorithm @@ -0,0 +1 @@ +Adversarial training is a widely used strategy for making neural networks resistant to adversarial perturbations. For a neural network of width $m$, $n$ input training data in $d$ dimension, it takes $\Omega(mnd)$ time cost per training iteration for the forward and backward computation. In this paper we analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted ReLU activation, and shows that only $o(m)$ neurons will be activated for each input data per iteration. Furthermore, we develop an algorithm for adversarial training with time cost $o(m n d)$ per iteration by applying half-space reporting data structure. \ No newline at end of file diff --git a/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors b/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors new file mode 100644 index 0000000000..e920bb4224 --- /dev/null +++ b/data/2024/iclr/A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors @@ -0,0 +1 @@ +The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes. \ No newline at end of file diff --git a/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance b/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance new file mode 100644 index 0000000000..2bfef231e9 --- /dev/null +++ b/data/2024/iclr/A Topological Perspective on Demystifying GNN-Based Link Prediction Performance @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have shown great promise in learning node embeddings for link prediction (LP). While numerous studies aim to improve the overall LP performance of GNNs, none have explored its varying performance across different nodes and its underlying reasons. To this end, we aim to demystify which nodes will perform better from the perspective of their local topology. Despite the widespread belief that low-degree nodes exhibit poorer LP performance, our empirical findings provide nuances to this viewpoint and prompt us to propose a better metric, Topological Concentration (TC), based on the intersection of the local subgraph of each node with the ones of its neighbors. We empirically demonstrate that TC has a higher correlation with LP performance than other node-level topological metrics like degree and subgraph density, offering a better way to identify low-performing nodes than using cold-start. With TC, we discover a novel topological distribution shift issue in which newly joined neighbors of a node tend to become less interactive with that node's existing neighbors, compromising the generalizability of node embeddings for LP at testing time. To make the computation of TC scalable, We further propose Approximated Topological Concentration (ATC) and theoretically/empirically justify its efficacy in approximating TC and reducing the computation complexity. Given the positive correlation between node TC and its LP performance, we explore the potential of boosting LP performance via enhancing TC by re-weighting edges in the message-passing and discuss its effectiveness with limitations. Our code is publicly available at https://github.com/YuWVandy/Topo_LP_GNN. \ No newline at end of file diff --git a/data/2024/iclr/A Unified Framework for Bayesian Optimization under Contextual Uncertainty b/data/2024/iclr/A Unified Framework for Bayesian Optimization under Contextual Uncertainty new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models b/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models new file mode 100644 index 0000000000..adbfceea35 --- /dev/null +++ b/data/2024/iclr/A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models @@ -0,0 +1 @@ +Recent years have witnessed the rapid progress and broad application of diffusion probabilistic models (DPMs). Sampling from DPMs can be viewed as solving an ordinary differential equation (ODE). Despite the promising performance, the generation of DPMs usually consumes much time due to the large number of function evaluations (NFE). Though recent works have accelerated the sampling to around 20 steps with high-order solvers, the sample quality with less than 10 NFE can still be improved. In this paper, we propose a unified sampling framework (USF) to study the optional strategies for solver. Under this framework, we further reveal that taking different solving strategies at different timesteps may help further decrease the truncation error, and a carefully designed \emph{solver schedule} has the potential to improve the sample quality by a large margin. Therefore, we propose a new sampling framework based on the exponential integral formulation that allows free choices of solver strategy at each step and design specific decisions for the framework. Moreover, we propose $S^3$, a predictor-based search method that automatically optimizes the solver schedule to get a better time-quality trade-off of sampling. We demonstrate that $S^3$ can find outstanding solver schedules which outperform the state-of-the-art sampling methods on CIFAR-10, CelebA, ImageNet, and LSUN-Bedroom datasets. Specifically, we achieve 2.69 FID with 10 NFE and 6.86 FID with 5 NFE on CIFAR-10 dataset, outperforming the SOTA method significantly. We further apply $S^3$ to Stable-Diffusion model and get an acceleration ratio of 2$\times$, showing the feasibility of sampling in very few steps without retraining the neural network. \ No newline at end of file diff --git a/data/2024/iclr/A Unified and General Framework for Continual Learning b/data/2024/iclr/A Unified and General Framework for Continual Learning new file mode 100644 index 0000000000..6646162139 --- /dev/null +++ b/data/2024/iclr/A Unified and General Framework for Continual Learning @@ -0,0 +1 @@ +Continual Learning (CL) focuses on learning from dynamic and changing data distributions while retaining previously acquired knowledge. Various methods have been developed to address the challenge of catastrophic forgetting, including regularization-based, Bayesian-based, and memory-replay-based techniques. However, these methods lack a unified framework and common terminology for describing their approaches. This research aims to bridge this gap by introducing a comprehensive and overarching framework that encompasses and reconciles these existing methodologies. Notably, this new framework is capable of encompassing established CL approaches as special instances within a unified and general optimization objective. An intriguing finding is that despite their diverse origins, these methods share common mathematical structures. This observation highlights the compatibility of these seemingly distinct techniques, revealing their interconnectedness through a shared underlying optimization objective. Moreover, the proposed general framework introduces an innovative concept called refresh learning, specifically designed to enhance the CL performance. This novel approach draws inspiration from neuroscience, where the human brain often sheds outdated information to improve the retention of crucial knowledge and facilitate the acquisition of new information. In essence, refresh learning operates by initially unlearning current data and subsequently relearning it. It serves as a versatile plug-in that seamlessly integrates with existing CL methods, offering an adaptable and effective enhancement to the learning process. Extensive experiments on CL benchmarks and theoretical analysis demonstrate the effectiveness of the proposed refresh learning. Code is available at \url{https://github.com/joey-wang123/CL-refresh-learning}. \ No newline at end of file diff --git a/data/2024/iclr/A Variational Framework for Estimating Continuous Treatment Effects with Measurement Error b/data/2024/iclr/A Variational Framework for Estimating Continuous Treatment Effects with Measurement Error new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models b/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models new file mode 100644 index 0000000000..ca02fce0ef --- /dev/null +++ b/data/2024/iclr/A Variational Perspective on Solving Inverse Problems with Diffusion Models @@ -0,0 +1 @@ +Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This is however challenging in diffusion models since the nonlinear and iterative nature of the diffusion process renders the posterior intractable. To cope with this challenge, we propose a variational approach that by design seeks to approximate the true posterior distribution. We show that our approach naturally leads to regularization by denoising diffusion process (RED-Diff) where denoisers at different timesteps concurrently impose different structural constraints over the image. To gauge the contribution of denoisers from different timesteps, we propose a weighting mechanism based on signal-to-noise-ratio (SNR). Our approach provides a new variational perspective for solving inverse problems with diffusion models, allowing us to formulate sampling as stochastic optimization, where one can simply apply off-the-shelf solvers with lightweight iterates. Our experiments for image restoration tasks such as inpainting and superresolution demonstrate the strengths of our method compared with state-of-the-art sampling-based diffusion models. \ No newline at end of file diff --git a/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables b/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables new file mode 100644 index 0000000000..49783aa096 --- /dev/null +++ b/data/2024/iclr/A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables @@ -0,0 +1 @@ +Most existing causal discovery methods rely on the assumption of no latent confounders, limiting their applicability in solving real-life problems. In this paper, we introduce a novel, versatile framework for causal discovery that accommodates the presence of causally-related hidden variables almost everywhere in the causal network (for instance, they can be effects of observed variables), based on rank information of covariance matrix over observed variables. We start by investigating the efficacy of rank in comparison to conditional independence and, theoretically, establish necessary and sufficient conditions for the identifiability of certain latent structural patterns. Furthermore, we develop a Rank-based Latent Causal Discovery algorithm, RLCD, that can efficiently locate hidden variables, determine their cardinalities, and discover the entire causal structure over both measured and hidden ones. We also show that, under certain graphical conditions, RLCD correctly identifies the Markov Equivalence Class of the whole latent causal graph asymptotically. Experimental results on both synthetic and real-world personality data sets demonstrate the efficacy of the proposed approach in finite-sample cases. \ No newline at end of file diff --git a/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing b/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing new file mode 100644 index 0000000000..936e751bf2 --- /dev/null +++ b/data/2024/iclr/A differentiable brain simulator bridging brain simulation and brain-inspired computing @@ -0,0 +1 @@ +Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC. \ No newline at end of file diff --git a/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges b/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges new file mode 100644 index 0000000000..311ce85dae --- /dev/null +++ b/data/2024/iclr/A path-norm toolkit for modern networks: consequences, promises and challenges @@ -0,0 +1 @@ +This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet. \ No newline at end of file diff --git a/data/2024/iclr/A representation-learning game for classes of prediction tasks b/data/2024/iclr/A representation-learning game for classes of prediction tasks new file mode 100644 index 0000000000..3607799a8f --- /dev/null +++ b/data/2024/iclr/A representation-learning game for classes of prediction tasks @@ -0,0 +1 @@ +We propose a game-based formulation for learning dimensionality-reducing representations of feature vectors, when only a prior knowledge on future prediction tasks is available. In this game, the first player chooses a representation, and then the second player adversarially chooses a prediction task from a given class, representing the prior knowledge. The first player aims is to minimize, and the second player to maximize, the regret: The minimal prediction loss using the representation, compared to the same loss using the original features. For the canonical setting in which the representation, the response to predict and the predictors are all linear functions, and under the mean squared error loss function, we derive the theoretically optimal representation in pure strategies, which shows the effectiveness of the prior knowledge, and the optimal regret in mixed strategies, which shows the usefulness of randomizing the representation. For general representations and loss functions, we propose an efficient algorithm to optimize a randomized representation. The algorithm only requires the gradients of the loss function, and is based on incrementally adding a representation rule to a mixture of such rules. \ No newline at end of file diff --git a/data/2024/iclr/A robust differential Neural ODE Optimizer b/data/2024/iclr/A robust differential Neural ODE Optimizer new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/A unique M-pattern for micro-expression spotting in long videos b/data/2024/iclr/A unique M-pattern for micro-expression spotting in long videos new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/ACRF: Compressing Explicit Neural Radiance Fields via Attribute Compression b/data/2024/iclr/ACRF: Compressing Explicit Neural Radiance Fields via Attribute Compression new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process b/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process new file mode 100644 index 0000000000..1d2cc4217e --- /dev/null +++ b/data/2024/iclr/ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process @@ -0,0 +1 @@ +Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code is released at \url{https://github.com/ChangyaoTian/ADDP}. \ No newline at end of file diff --git a/data/2024/iclr/ADOPD: A Large-Scale Document Page Decomposition Dataset b/data/2024/iclr/ADOPD: A Large-Scale Document Page Decomposition Dataset new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation b/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation new file mode 100644 index 0000000000..fc6dacbc62 --- /dev/null +++ b/data/2024/iclr/AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation @@ -0,0 +1 @@ +During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies. \ No newline at end of file diff --git a/data/2024/iclr/ALAM: Averaged Low-Precision Activation for Memory-Efficient Training of Transformer Models b/data/2024/iclr/ALAM: Averaged Low-Precision Activation for Memory-Efficient Training of Transformer Models new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents b/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents new file mode 100644 index 0000000000..195f03a7f4 --- /dev/null +++ b/data/2024/iclr/AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents @@ -0,0 +1 @@ +We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is scalable and applicable to a wide range of problems, and we demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a multi-goal hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. \ No newline at end of file diff --git a/data/2024/iclr/ARGS: Alignment as Reward-Guided Search b/data/2024/iclr/ARGS: Alignment as Reward-Guided Search new file mode 100644 index 0000000000..f22f8542a9 --- /dev/null +++ b/data/2024/iclr/ARGS: Alignment as Reward-Guided Search @@ -0,0 +1 @@ +Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model's probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences, offering a promising and flexible solution for aligning language models. Notably, ARGS demonstrates consistent enhancements in average reward compared to baselines across diverse alignment tasks and various model dimensions. For example, under the same greedy-based decoding strategy, our method improves the average reward by 19.56% relative to the baseline and secures a preference or tie score of 64.33% in GPT-4 evaluation. We believe that our framework, emphasizing decoding-time alignment, paves the way for more responsive language models in the future. Code is publicly available at: \url{https://github.com/deeplearning-wisc/args}. \ No newline at end of file diff --git a/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning b/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning new file mode 100644 index 0000000000..b408082d90 --- /dev/null +++ b/data/2024/iclr/ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning @@ -0,0 +1 @@ +Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer. \ No newline at end of file diff --git a/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation b/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation new file mode 100644 index 0000000000..dba63fa22c --- /dev/null +++ b/data/2024/iclr/ASID: Active Exploration for System Identification in Robotic Manipulation @@ -0,0 +1 @@ +Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid \ No newline at end of file diff --git a/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference b/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference new file mode 100644 index 0000000000..489692ad90 --- /dev/null +++ b/data/2024/iclr/ASMR: Activation-Sharing Multi-Resolution Coordinate Networks for Efficient Inference @@ -0,0 +1 @@ +Coordinate network or implicit neural representation (INR) is a fast-emerging method for encoding natural signals (such as images and videos) with the benefits of a compact neural representation. While numerous methods have been proposed to increase the encoding capabilities of an INR, an often overlooked aspect is the inference efficiency, usually measured in multiply-accumulate (MAC) count. This is particularly critical in use cases where inference throughput is greatly limited by hardware constraints. To this end, we propose the Activation-Sharing Multi-Resolution (ASMR) coordinate network that combines multi-resolution coordinate decomposition with hierarchical modulations. Specifically, an ASMR model enables the sharing of activations across grids of the data. This largely decouples its inference cost from its depth which is directly correlated to its reconstruction capability, and renders a near O(1) inference complexity irrespective of the number of layers. Experiments show that ASMR can reduce the MAC of a vanilla SIREN model by up to 500x while achieving an even higher reconstruction quality than its SIREN baseline. \ No newline at end of file diff --git a/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning b/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning new file mode 100644 index 0000000000..71c0f1f81c --- /dev/null +++ b/data/2024/iclr/AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning @@ -0,0 +1 @@ +Self-supervised learning through contrastive representations is an emergent and promising avenue, aiming at alleviating the availability of labeled data. Recent research in the field also demonstrates its viability for several downstream tasks, henceforth leading to works that implement the contrastive principle through innovative loss functions and methods. However, despite achieving impressive progress, most methods depend on prohibitively large batch sizes and compute requirements for good performance. In this work, we propose the AUC - C ontrastive L earning, a new approach to contrastive learning that demonstrates robust and competitive performance in compute-limited regimes. We propose to incorporate the contrastive objective within the AUC-maximization framework, by noting that the AUC metric is maximized upon enhancing the probability of the network’s binary prediction difference between positive and negative samples which inspires adequate embedding space arrangements in representation learning. Unlike standard contrastive methods, when performing stochastic optimization, our method maintains unbiased stochastic gradients and thus is more robust to batchsizes as opposed to standard stochastic optimization problems. Remarkably, our method with a batch size of 256, outperforms several state-of-the-art methods that may need much larger batch sizes (e.g., 4096), on ImageNet and other standard datasets. Experiments on transfer learning and few-shot learning tasks also demonstrate the downstream viability of our method. Code is available at AUC-CL \ No newline at end of file diff --git a/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images b/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images new file mode 100644 index 0000000000..81d2a72c17 --- /dev/null +++ b/data/2024/iclr/AUGCAL: Improving Sim2Real Adaptation by Uncertainty Calibration on Augmented Synthetic Images @@ -0,0 +1 @@ +Synthetic data (SIM) drawn from simulators have emerged as a popular alternative for training models where acquiring annotated real-world images is difficult. However, transferring models trained on synthetic images to real-world applications can be challenging due to appearance disparities. A commonly employed solution to counter this SIM2REAL gap is unsupervised domain adaptation, where models are trained using labeled SIM data and unlabeled REAL data. Mispredictions made by such SIM2REAL adapted models are often associated with miscalibration - stemming from overconfident predictions on real data. In this paper, we introduce AUGCAL, a simple training-time patch for unsupervised adaptation that improves SIM2REAL adapted models by - (1) reducing overall miscalibration, (2) reducing overconfidence in incorrect predictions and (3) improving confidence score reliability by better guiding misclassification detection - all while retaining or improving SIM2REAL performance. Given a base SIM2REAL adaptation algorithm, at training time, AUGCAL involves replacing vanilla SIM images with strongly augmented views (AUG intervention) and additionally optimizing for a training time calibration loss on augmented SIM predictions (CAL intervention). We motivate AUGCAL using a brief analytical justification of how to reduce miscalibration on unlabeled REAL data. Through our experiments, we empirically show the efficacy of AUGCAL across multiple adaptation methods, backbones, tasks and shifts. \ No newline at end of file diff --git a/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers b/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers new file mode 100644 index 0000000000..777f9a3ff2 --- /dev/null +++ b/data/2024/iclr/Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers @@ -0,0 +1 @@ +An extension of Transformers is proposed that enables explicit relational reasoning through a novel module called the Abstractor. At the core of the Abstractor is a variant of attention called relational cross-attention. The approach is motivated by an architectural inductive bias for relational learning that disentangles relational information from object-level features. This enables explicit relational reasoning, supporting abstraction and generalization from limited data. The Abstractor is first evaluated on simple discriminative relational tasks and compared to existing relational architectures. Next, the Abstractor is evaluated on purely relational sequence-to-sequence tasks, where dramatic improvements are seen in sample efficiency compared to standard Transformers. Finally, Abstractors are evaluated on a collection of tasks based on mathematical problem solving, where consistent improvements in performance and sample efficiency are observed. \ No newline at end of file diff --git a/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise b/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise new file mode 100644 index 0000000000..590814960a --- /dev/null +++ b/data/2024/iclr/Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise @@ -0,0 +1 @@ +Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice. \ No newline at end of file diff --git a/data/2024/iclr/Accelerated Sampling with Stacked Restricted Boltzmann Machines b/data/2024/iclr/Accelerated Sampling with Stacked Restricted Boltzmann Machines new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling b/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling new file mode 100644 index 0000000000..06ad5b8161 --- /dev/null +++ b/data/2024/iclr/Accelerating Data Generation for Neural Operators via Krylov Subspace Recycling @@ -0,0 +1 @@ +Learning neural operators for solving partial differential equations (PDEs) has attracted great attention due to its high inference efficiency. However, training such operators requires generating a substantial amount of labeled data, i.e., PDE problems together with their solutions. The data generation process is exceptionally time-consuming, as it involves solving numerous systems of linear equations to obtain numerical solutions to the PDEs. Many existing methods solve these systems independently without considering their inherent similarities, resulting in extremely redundant computations. To tackle this problem, we propose a novel method, namely Sorting Krylov Recycling (SKR), to boost the efficiency of solving these systems, thus significantly accelerating data generation for neural operators training. To the best of our knowledge, SKR is the first attempt to address the time-consuming nature of data generation for learning neural operators. The working horse of SKR is Krylov subspace recycling, a powerful technique for solving a series of interrelated systems by leveraging their inherent similarities. Specifically, SKR employs a sorting algorithm to arrange these systems in a sequence, where adjacent systems exhibit high similarities. Then it equips a solver with Krylov subspace recycling to solve the systems sequentially instead of independently, thus effectively enhancing the solving efficiency. Both theoretical analysis and extensive experiments demonstrate that SKR can significantly accelerate neural operator data generation, achieving a remarkable speedup of up to 13.9 times. \ No newline at end of file diff --git a/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks b/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks new file mode 100644 index 0000000000..67cdaf5834 --- /dev/null +++ b/data/2024/iclr/Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks @@ -0,0 +1 @@ +We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar {\alpha}, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/{\alpha}) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/{\alpha}^2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings. \ No newline at end of file diff --git a/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations b/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations new file mode 100644 index 0000000000..ddce0bb73f --- /dev/null +++ b/data/2024/iclr/Accelerating Sinkhorn algorithm with sparse Newton iterations @@ -0,0 +1 @@ +Computing the optimal transport distance between statistical distributions is a fundamental task in machine learning. One remarkable recent advancement is entropic regularization and the Sinkhorn algorithm, which utilizes only matrix scaling and guarantees an approximated solution with near-linear runtime. Despite the success of the Sinkhorn algorithm, its runtime may still be slow due to the potentially large number of iterations needed for convergence. To achieve possibly super-exponential convergence, we present Sinkhorn-Newton-Sparse (SNS), an extension to the Sinkhorn algorithm, by introducing early stopping for the matrix scaling steps and a second stage featuring a Newton-type subroutine. Adopting the variational viewpoint that the Sinkhorn algorithm maximizes a concave Lyapunov potential, we offer the insight that the Hessian matrix of the potential function is approximately sparse. Sparsification of the Hessian results in a fast $O(n^2)$ per-iteration complexity, the same as the Sinkhorn algorithm. In terms of total iteration count, we observe that the SNS algorithm converges orders of magnitude faster across a wide range of practical cases, including optimal transportation between empirical distributions and calculating the Wasserstein $W_1, W_2$ distance of discretized densities. The empirical performance is corroborated by a rigorous bound on the approximate sparsity of the Hessian matrix. \ No newline at end of file diff --git a/data/2024/iclr/Accurate Forgetting for Heterogeneous Federated Continual Learning b/data/2024/iclr/Accurate Forgetting for Heterogeneous Federated Continual Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models b/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models new file mode 100644 index 0000000000..2b767a223b --- /dev/null +++ b/data/2024/iclr/Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models @@ -0,0 +1 @@ +Given a pretrained encoder-based language model, how can we accurately compress it without retraining? Retraining-free structured pruning algorithms are crucial in pretrained language model compression due to their significantly reduced pruning cost and capability to prune large language models. However, existing retraining-free algorithms encounter severe accuracy degradation, as they fail to handle pruning errors, especially at high compression rates. In this paper, we propose K-prune (Knowledge-preserving pruning), an accurate retraining-free structured pruning algorithm for pretrained encoder-based language models. K-prune focuses on preserving the useful knowledge of the pretrained model to minimize pruning errors through a carefully designed iterative pruning process composed of knowledge measurement, knowledge-preserving mask search, and knowledge-preserving weight-tuning. As a result, K-prune shows significant accuracy improvements up to 58.02%p higher F1 score compared to existing retraining-free pruning algorithms under a high compression rate of 80% on the SQuAD benchmark without any retraining process. \ No newline at end of file diff --git a/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks b/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks new file mode 100644 index 0000000000..85e3b8a09c --- /dev/null +++ b/data/2024/iclr/Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks @@ -0,0 +1 @@ +While graph neural networks (GNNs) are widely used for node and graph representation learning tasks, the reliability of GNN uncertainty estimates under distribution shifts remains relatively under-explored. Indeed, while post-hoc calibration strategies can be used to improve in-distribution calibration, they need not also improve calibration under distribution shift. However, techniques which produce GNNs with better intrinsic uncertainty estimates are particularly valuable, as they can always be combined with post-hoc strategies later. Therefore, in this work, we propose G-$\Delta$UQ, a novel training framework designed to improve intrinsic GNN uncertainty estimates. Our framework adapts the principle of stochastic data centering to graph data through novel graph anchoring strategies, and is able to support partially stochastic GNNs. While, the prevalent wisdom is that fully stochastic networks are necessary to obtain reliable estimates, we find that the functional diversity induced by our anchoring strategies when sampling hypotheses renders this unnecessary and allows us to support G-$\Delta$UQ on pretrained models. Indeed, through extensive evaluation under covariate, concept and graph size shifts, we show that G-$\Delta$UQ leads to better calibrated GNNs for node and graph classification. Further, it also improves performance on the uncertainty-based tasks of out-of-distribution detection and generalization gap estimation. Overall, our work provides insights into uncertainty estimation for GNNs, and demonstrates the utility of G-$\Delta$UQ in obtaining reliable estimates. \ No newline at end of file diff --git a/data/2024/iclr/Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning b/data/2024/iclr/Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Achieving Human Parity in Content-Grounded Datasets Generation b/data/2024/iclr/Achieving Human Parity in Content-Grounded Datasets Generation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping b/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping new file mode 100644 index 0000000000..676114ec61 --- /dev/null +++ b/data/2024/iclr/Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping @@ -0,0 +1 @@ +Reinforcement learning often needs to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces (often known as the curse of dimensionality). In this work, we address this issue by learning the inherent structure of action-wise similar MDP to appropriately balance the performance degradation versus sample/computational complexity. In particular, we partition the action spaces into multiple groups based on the similarity in transition distribution and reward function, and build a linear decomposition model to capture the difference between the intra-group transition kernel and the intra-group rewards. Both our theoretical analysis and experiments reveal a \emph{surprising and counter-intuitive result}: while a more refined grouping strategy can reduce the approximation error caused by treating actions in the same group as identical, it also leads to increased estimation error when the size of samples or the computation resources is limited. This finding highlights the grouping strategy as a new degree of freedom that can be optimized to minimize the overall performance loss. To address this issue, we formulate a general optimization problem for determining the optimal grouping strategy, which strikes a balance between performance loss and sample/computational complexity. We further propose a computationally efficient method for selecting a nearly-optimal grouping strategy, which maintains its computational complexity independent of the size of the action space. \ No newline at end of file diff --git a/data/2024/iclr/Active Retrosynthetic Planning Aware of Route Quality b/data/2024/iclr/Active Retrosynthetic Planning Aware of Route Quality new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm b/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm new file mode 100644 index 0000000000..6743295317 --- /dev/null +++ b/data/2024/iclr/Active Test-Time Adaptation: Theoretical Analyses and An Algorithm @@ -0,0 +1 @@ +Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. Currently, most TTA methods can only deal with minor shifts and rely heavily on heuristic and empirical studies. To advance TTA under domain shifts, we propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting. We provide a learning theory analysis, demonstrating that incorporating limited labeled test instances enhances overall performances across test domains with a theoretical guarantee. We also present a sample entropy balancing for implementing ATTA while avoiding catastrophic forgetting (CF). We introduce a simple yet effective ATTA algorithm, known as SimATTA, using real-time sample selection techniques. Extensive experimental results confirm consistency with our theoretical analyses and show that the proposed ATTA method yields substantial performance improvements over TTA methods while maintaining efficiency and shares similar effectiveness to the more demanding active domain adaptation (ADA) methods. Our code is available at https://github.com/divelab/ATTA \ No newline at end of file diff --git a/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning b/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning new file mode 100644 index 0000000000..9b7a8928b7 --- /dev/null +++ b/data/2024/iclr/AdaMerging: Adaptive Model Merging for Multi-Task Learning @@ -0,0 +1 @@ +Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11\% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase. \ No newline at end of file diff --git a/data/2024/iclr/Adapting Large Language Models via Reading Comprehension b/data/2024/iclr/Adapting Large Language Models via Reading Comprehension new file mode 100644 index 0000000000..f1b6363342 --- /dev/null +++ b/data/2024/iclr/Adapting Large Language Models via Reading Comprehension @@ -0,0 +1 @@ +We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension—practice after reading improves the ability to answer questions based on the learned knowledge—we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance \ No newline at end of file diff --git a/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation b/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation new file mode 100644 index 0000000000..496d3f72ed --- /dev/null +++ b/data/2024/iclr/Adapting to Distribution Shift by Visual Domain Prompt Generation @@ -0,0 +1 @@ +In this paper, we aim to adapt a model at test-time using a few unlabeled data to address distribution shifts. To tackle the challenges of extracting domain knowledge from a limited amount of data, it is crucial to utilize correlated information from pre-trained backbones and source domains. Previous studies fail to utilize recent foundation models with strong out-of-distribution generalization. Additionally, domain-centric designs are not flavored in their works. Furthermore, they employ the process of modelling source domains and the process of learning to adapt independently into disjoint training stages. In this work, we propose an approach on top of the pre-computed features of the foundation model. Specifically, we build a knowledge bank to learn the transferable knowledge from source domains. Conditioned on few-shot target data, we introduce a domain prompt generator to condense the knowledge bank into a domain-specific prompt. The domain prompt then directs the visual features towards a particular domain via a guidance module. Moreover, we propose a domain-aware contrastive loss and employ meta-learning to facilitate domain knowledge extraction. Extensive experiments are conducted to validate the domain knowledge extraction. The proposed method outperforms previous work on 5 large-scale benchmarks including WILDS and DomainNet. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts b/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts new file mode 100644 index 0000000000..943961d50c --- /dev/null +++ b/data/2024/iclr/Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts @@ -0,0 +1 @@ +By providing external information to large language models (LLMs), tool augmentation (including retrieval augmentation) has emerged as a promising solution for addressing the limitations of LLMs' static parametric memory. However, how receptive are LLMs to such external evidence, especially when the evidence conflicts with their parametric memory? We present the first comprehensive and controlled investigation into the behavior of LLMs when encountering knowledge conflicts. We propose a systematic framework to elicit high-quality parametric memory from LLMs and construct the corresponding counter-memory, which enables us to conduct a series of controlled experiments. Our investigation reveals seemingly contradicting behaviors of LLMs. On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. These results pose important implications that are worth careful consideration for the further development and deployment of tool- and retrieval-augmented LLMs. Resources are available at https://github.com/OSU-NLP-Group/LLM-Knowledge-Conflict. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients b/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients new file mode 100644 index 0000000000..24508690d5 --- /dev/null +++ b/data/2024/iclr/Adaptive Federated Learning with Auto-Tuned Clients @@ -0,0 +1 @@ +Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose $\Delta$-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments b/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments new file mode 100644 index 0000000000..e9e5ea3824 --- /dev/null +++ b/data/2024/iclr/Adaptive Instrument Design for Indirect Experiments @@ -0,0 +1 @@ +Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning b/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning new file mode 100644 index 0000000000..7727d61299 --- /dev/null +++ b/data/2024/iclr/Adaptive Rational Activations to Boost Deep Reinforcement Learning @@ -0,0 +1 @@ +Latest insights from biology show that intelligence not only emerges from the connections between neurons but that individual neurons shoulder more computational responsibility than previously anticipated. This perspective should be critical in the context of constantly changing distinct reinforcement learning environments, yet current approaches still primarily employ static activation functions. In this work, we motivate why rationals are suitable for adaptable activation functions and why their inclusion into neural networks is crucial. Inspired by recurrence in residual networks, we derive a condition under which rational units are closed under residual connections and formulate a naturally regularised version: the recurrent-rational. We demonstrate that equipping popular algorithms with (recurrent-)rational activations leads to consistent improvements on Atari games, especially turning simple DQN into a solid approach, competitive to DDQN and Rainbow. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice b/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice new file mode 100644 index 0000000000..831f2dd2e2 --- /dev/null +++ b/data/2024/iclr/Adaptive Regret for Bandits Made Possible: Two Queries Suffice @@ -0,0 +1 @@ +Fast changing states or volatile environments pose a significant challenge to online optimization, which needs to perform rapid adaptation under limited observation. In this paper, we give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret, which measures the maximum regret over any contiguous interval $I$. Due to its worst-case nature, there is an almost-linear $\Omega(|I|^{1-\epsilon})$ regret lower bound, when only one query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $\tilde{O}(\sqrt{n|I|})$ adaptive regret for multi-armed bandits with $n$ arms. The bound is tight and cannot be improved in general. Our algorithm leverages a multiplicative update scheme of varying stepsizes and a carefully chosen observation distribution to control the variance. Furthermore, we extend our results and provide optimal algorithms in the bandit convex optimization setting. Finally, we empirically demonstrate the superior performance of our algorithms under volatile environments and for downstream tasks, such as algorithm selection for hyperparameter optimization. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation b/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation new file mode 100644 index 0000000000..8dc30d92e4 --- /dev/null +++ b/data/2024/iclr/Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation @@ -0,0 +1 @@ +Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at https://github.com/sweetice/BEER-ICLR2024. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders b/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders new file mode 100644 index 0000000000..e7f791c117 --- /dev/null +++ b/data/2024/iclr/Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders @@ -0,0 +1 @@ +Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation b/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation new file mode 100644 index 0000000000..e98ef2de70 --- /dev/null +++ b/data/2024/iclr/Adaptive Self-training Framework for Fine-grained Scene Graph Generation @@ -0,0 +1 @@ +Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets such as the long-tailed predicate distribution and missing annotation problems. In this work, we aim to alleviate the long-tailed problem of SGG by utilizing unannotated triplets. To this end, we introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets based on which the SGG models are trained. While there has been significant progress in self-training for image recognition, designing a self-training framework for the SGG task is more challenging due to its inherent nature such as the semantic ambiguity and the long-tailed distribution of predicate classes. Hence, we propose a novel pseudo-labeling technique for SGG, called Class-specific Adaptive Thresholding with Momentum (CATM), which is a model-agnostic framework that can be applied to any existing SGG models. Furthermore, we devise a graph structure learner (GSL) that is beneficial when adopting our proposed self-training framework to the state-of-the-art message-passing neural network (MPNN)-based SGG models. Our extensive experiments verify the effectiveness of ST-SGG on various SGG models, particularly in enhancing the performance on fine-grained predicate classes. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks b/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks new file mode 100644 index 0000000000..fa85002170 --- /dev/null +++ b/data/2024/iclr/Adaptive Sharpness-Aware Pruning for Robust Sparse Networks @@ -0,0 +1 @@ +Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce Adaptive Sharpness-Aware Pruning (AdaSAP), which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are unseen at training time. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on image classification by up to +6% on ImageNet C and +4% on ImageNet V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning b/data/2024/iclr/Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring b/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring new file mode 100644 index 0000000000..e25561a189 --- /dev/null +++ b/data/2024/iclr/Adaptive Window Pruning for Efficient Local Motion Deblurring @@ -0,0 +1 @@ +Local motion blur commonly occurs in real-world photography due to the mixing between moving objects and stationary backgrounds during exposure. Existing image deblurring methods predominantly focus on global deblurring, inadvertently affecting the sharpness of backgrounds in locally blurred images and wasting unnecessary computation on sharp pixels, especially for high-resolution images. This paper aims to adaptively and efficiently restore high-resolution locally blurred images. We propose a local motion deblurring vision Transformer (LMD-ViT) built on adaptive window pruning Transformer blocks (AdaWPT). To focus deblurring on local regions and reduce computation, AdaWPT prunes unnecessary windows, only allowing the active windows to be involved in the deblurring processes. The pruning operation relies on the blurriness confidence predicted by a confidence predictor that is trained end-to-end using a reconstruction loss with Gumbel-Softmax re-parameterization and a pruning loss guided by annotated blur masks. Our method removes local motion blur effectively without distorting sharp regions, demonstrated by its exceptional perceptual and quantitative improvements compared to state-of-the-art methods. In addition, our approach substantially reduces FLOPs by 66% and achieves more than a twofold increase in inference speed compared to Transformer-based deblurring methods. We will make our code and annotated blur masks publicly available. \ No newline at end of file diff --git a/data/2024/iclr/Adaptive deep spiking neural network with global-local learning via balanced excitatory and inhibitory mechanism b/data/2024/iclr/Adaptive deep spiking neural network with global-local learning via balanced excitatory and inhibitory mechanism new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning b/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning new file mode 100644 index 0000000000..54a0adc215 --- /dev/null +++ b/data/2024/iclr/Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning @@ -0,0 +1 @@ +Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues. \ No newline at end of file diff --git a/data/2024/iclr/Addressing Signal Delay in Deep Reinforcement Learning b/data/2024/iclr/Addressing Signal Delay in Deep Reinforcement Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models b/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models new file mode 100644 index 0000000000..c1fe24b885 --- /dev/null +++ b/data/2024/iclr/AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models @@ -0,0 +1 @@ +Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, na\"ive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing. \ No newline at end of file diff --git a/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models b/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models new file mode 100644 index 0000000000..823cd9e1e7 --- /dev/null +++ b/data/2024/iclr/Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models @@ -0,0 +1 @@ +Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/tencent-ailab/PCDMs. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs b/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs new file mode 100644 index 0000000000..53a52b7394 --- /dev/null +++ b/data/2024/iclr/Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs @@ -0,0 +1 @@ +Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust random samples in the training set such that the residual induced by the approximate PDE solution can maintain a smooth profile when it is being minimized. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks b/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks new file mode 100644 index 0000000000..64e11d7ae0 --- /dev/null +++ b/data/2024/iclr/Adversarial Attacks on Fairness of Graph Neural Networks @@ -0,0 +1 @@ +Fairness-aware graph neural networks (GNNs) have gained a surge of attention as they can reduce the bias of predictions on any demographic group (e.g., female) in graph-based applications. Although these methods greatly improve the algorithmic fairness of GNNs, the fairness can be easily corrupted by carefully designed adversarial attacks. In this paper, we investigate the problem of adversarial attacks on fairness of GNNs and propose G-FairAttack, a general framework for attacking various types of fairness-aware GNNs in terms of fairness with an unnoticeable effect on prediction utility. In addition, we propose a fast computation technique to reduce the time complexity of G-FairAttack. The experimental study demonstrates that G-FairAttack successfully corrupts the fairness of different types of GNNs while keeping the attack unnoticeable. Our study on fairness attacks sheds light on potential vulnerabilities in fairness-aware GNNs and guides further research on the robustness of GNNs in terms of fairness. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial AutoMixup b/data/2024/iclr/Adversarial AutoMixup new file mode 100644 index 0000000000..c54b09da84 --- /dev/null +++ b/data/2024/iclr/Adversarial AutoMixup @@ -0,0 +1 @@ +Data mixing augmentation has been widely applied to improve the generalization ability of deep neural networks. Recently, offline data mixing augmentation, e.g. handcrafted and saliency information-based mixup, has been gradually replaced by automatic mixing approaches. Through minimizing two sub-tasks, namely, mixed sample generation and mixup classification in an end-to-end way, AutoMix significantly improves accuracy on image classification tasks. However, as the optimization objective is consistent for the two sub-tasks, this approach is prone to generating consistent instead of diverse mixed samples, which results in overfitting for target task training. In this paper, we propose AdAutomixup, an adversarial automatic mixup augmentation approach that generates challenging samples to train a robust classifier for image classification, by alternatively optimizing the classifier and the mixup sample generator. AdAutomixup comprises two modules, a mixed example generator, and a target classifier. The mixed sample generator aims to produce hard mixed examples to challenge the target classifier, while the target classifier's aim is to learn robust features from hard mixed examples to improve generalization. To prevent the collapse of the inherent meanings of images, we further introduce an exponential moving average (EMA) teacher and cosine similarity to train AdAutomixup in an end-to-end way. Extensive experiments on seven image benchmarks consistently prove that our approach outperforms the state of the art in various classification scenarios. The source code is available at https://github.com/JinXins/Adversarial-AutoMixup. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Causal Bayesian Optimization b/data/2024/iclr/Adversarial Causal Bayesian Optimization new file mode 100644 index 0000000000..a7528ee404 --- /dev/null +++ b/data/2024/iclr/Adversarial Causal Bayesian Optimization @@ -0,0 +1 @@ +In Causal Bayesian Optimization (CBO), an agent intervenes on an unknown structural causal model to maximize a downstream reward variable. In this paper, we consider the generalization where other agents or external events also intervene on the system, which is key for enabling adaptiveness to non-stationarities such as weather changes, market forces, or adversaries. We formalize this generalization of CBO as Adversarial Causal Bayesian Optimization (ACBO) and introduce the first algorithm for ACBO with bounded regret: Causal Bayesian Optimization with Multiplicative Weights (CBO-MW). Our approach combines a classical online learning strategy with causal modeling of the rewards. To achieve this, it computes optimistic counterfactual reward estimates by propagating uncertainty through the causal graph. We derive regret bounds for CBO-MW that naturally depend on graph-related quantities. We further propose a scalable implementation for the case of combinatorial interventions and submodular rewards. Empirically, CBO-MW outperforms non-causal and non-adversarial Bayesian optimization methods on synthetic environments and environments based on real-word data. Our experiments include a realistic demonstration of how CBO-MW can be used to learn users' demand patterns in a shared mobility system and reposition vehicles in strategic areas. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor b/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor new file mode 100644 index 0000000000..ded090f707 --- /dev/null +++ b/data/2024/iclr/Adversarial Feature Map Pruning for Backdoor @@ -0,0 +1 @@ +Deep neural networks have been widely used in many critical applications, such as autonomous vehicles and medical diagnosis. However, their security is threatened by backdoor attacks, which are achieved by adding artificial patterns to specific training data. Existing defense strategies primarily focus on using reverse engineering to reproduce the backdoor trigger generated by attackers and subsequently repair the DNN model by adding the trigger into inputs and fine-tuning the model with ground-truth labels. However, once the trigger generated by the attackers is complex and invisible, the defender cannot reproduce the trigger successfully then the DNN model will not be repaired, as the trigger is not effectively removed. In this work, we propose Adversarial Feature Map Pruning for Backdoor (FMP) to mitigate backdoor from the DNN. Unlike existing defense strategies, which focus on reproducing backdoor triggers, FMP attempts to prune backdoor feature maps, which are trained to extract backdoor information from inputs. After pruning these backdoor feature maps, FMP will fine-tune the model with a secure subset of training data. Our experiments demonstrate that, compared to existing defense strategies, FMP can effectively reduce the Attack Success Rate (ASR) even against the most complex and invisible attack triggers (e.g., FMP decreases the ASR to 2.86\% in CIFAR10, which is 19.2\% to 65.41\% lower than baselines). Second, unlike conventional defense methods that tend to exhibit low robust accuracy (that is, the accuracy of the model on poisoned data), FMP achieves a higher RA, indicating its superiority in maintaining model performance while mitigating the effects of backdoor attacks (e.g., FMP obtains 87.40\% RA in CIFAR10). Our code is publicly available at: https://github.com/retsuh-bqw/FMP. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Imitation Learning via Boosting b/data/2024/iclr/Adversarial Imitation Learning via Boosting new file mode 100644 index 0000000000..ee5d9819df --- /dev/null +++ b/data/2024/iclr/Adversarial Imitation Learning via Boosting @@ -0,0 +1 @@ +Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive b/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive new file mode 100644 index 0000000000..c933ef726c --- /dev/null +++ b/data/2024/iclr/Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive @@ -0,0 +1 @@ +Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points). \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game b/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game new file mode 100644 index 0000000000..a9b7ade855 --- /dev/null +++ b/data/2024/iclr/Adversarial Training Should Be Cast as a Non-Zero-Sum Game @@ -0,0 +1 @@ +One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting. \ No newline at end of file diff --git a/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization b/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization new file mode 100644 index 0000000000..d15781b556 --- /dev/null +++ b/data/2024/iclr/Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization @@ -0,0 +1 @@ +The deep neural networks are known to be vulnerable to well-designed adversarial attacks. The most successful defense technique based on adversarial training (AT) can achieve optimal robustness against particular attacks but cannot generalize well to unseen attacks. Another effective defense technique based on adversarial purification (AP) can enhance generalization but cannot achieve optimal robustness. Meanwhile, both methods share one common limitation on the degraded standard accuracy. To mitigate these issues, we propose a novel pipeline to acquire the robust purifier model, named Adversarial Training on Purification (AToP), which comprises two components: perturbation destruction by random transforms (RT) and purifier model fine-tuned (FT) by adversarial loss. RT is essential to avoid overlearning to known attacks, resulting in the robustness generalization to unseen attacks, and FT is essential for the improvement of robustness. To evaluate our method in an efficient and scalable way, we conduct extensive experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our method achieves optimal robustness and exhibits generalization ability against unseen attacks. \ No newline at end of file diff --git a/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models b/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models new file mode 100644 index 0000000000..8908107f45 --- /dev/null +++ b/data/2024/iclr/AffineQuant: Affine Transformation Quantization for Large Language Models @@ -0,0 +1 @@ +The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art benchmark for PTQ in LLMs. \ No newline at end of file diff --git a/data/2024/iclr/AgentBench: Evaluating LLMs as Agents b/data/2024/iclr/AgentBench: Evaluating LLMs as Agents new file mode 100644 index 0000000000..82f7bdcc37 --- /dev/null +++ b/data/2024/iclr/AgentBench: Evaluating LLMs as Agents @@ -0,0 +1 @@ +Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at \url{https://github.com/THUDM/AgentBench}. \ No newline at end of file diff --git a/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors b/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors new file mode 100644 index 0000000000..a468f6347f --- /dev/null +++ b/data/2024/iclr/AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors @@ -0,0 +1 @@ +Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}. \ No newline at end of file diff --git a/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction b/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction new file mode 100644 index 0000000000..2b31fef935 --- /dev/null +++ b/data/2024/iclr/AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction @@ -0,0 +1 @@ +Air quality prediction and modelling plays a pivotal role in public health and environment management, for individuals and authorities to make informed decisions. Although traditional data-driven models have shown promise in this domain, their long-term prediction accuracy can be limited, especially in scenarios with sparse or incomplete data and they often rely on black-box deep learning structures that lack solid physical foundation leading to reduced transparency and interpretability in predictions. To address these limitations, this paper presents a novel approach named Physics guided Neural Network for Air Quality Prediction (AirPhyNet). Specifically, we leverage two well-established physics principles of air particle movement (diffusion and advection) by representing them as differential equation networks. Then, we utilize a graph structure to integrate physics knowledge into a neural network architecture and exploit latent representations to capture spatio-temporal relationships within the air quality data. Experiments on two real-world benchmark datasets demonstrate that AirPhyNet outperforms state-of-the-art models for different testing scenarios including different lead time (24h, 48h, 72h), sparse data and sudden change prediction, achieving reduction in prediction errors up to 10%. Moreover, a case study further validates that our model captures underlying physical processes of particle movement and generates accurate predictions with real physical meaning. \ No newline at end of file diff --git a/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions b/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions new file mode 100644 index 0000000000..31434e3171 --- /dev/null +++ b/data/2024/iclr/Algorithms for Caching and MTS with reduced number of predictions @@ -0,0 +1 @@ +ML-augmented algorithms utilize predictions to achieve performance beyond their worst-case bounds. Producing these predictions might be a costly operation -- this motivated Im et al. '22 to introduce the study of algorithms which use predictions parsimoniously. We design parsimonious algorithms for caching and MTS with action predictions, proposed by Antoniadis et al. '20, focusing on the parameters of consistency (performance with perfect predictions) and smoothness (dependence of their performance on the prediction error). Our algorithm for caching is 1-consistent, robust, and its smoothness deteriorates with the decreasing number of available predictions. We propose an algorithm for general MTS whose consistency and smoothness both scale linearly with the decreasing number of predictions. Without the restriction on the number of available predictions, both algorithms match the earlier guarantees achieved by Antoniadis et al. '20. \ No newline at end of file diff --git a/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic b/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic new file mode 100644 index 0000000000..373fe5108e --- /dev/null +++ b/data/2024/iclr/Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic @@ -0,0 +1 @@ +For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server has been set up for the community to evaluate methods conveniently and fairly. Datasets and the online server details are available at https://sites.google.com/view/alice-benchmarks. \ No newline at end of file diff --git a/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework b/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework new file mode 100644 index 0000000000..c7c7015de2 --- /dev/null +++ b/data/2024/iclr/Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework @@ -0,0 +1 @@ +Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR. \ No newline at end of file diff --git a/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model b/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model new file mode 100644 index 0000000000..6d992b85bd --- /dev/null +++ b/data/2024/iclr/AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model @@ -0,0 +1 @@ +Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RL from Human Feedback (RLHF) to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/. \ No newline at end of file diff --git a/data/2024/iclr/Aligning Relational Learning with Lipschitz Fairness b/data/2024/iclr/Aligning Relational Learning with Lipschitz Fairness new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps b/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps new file mode 100644 index 0000000000..f1cbaf34cc --- /dev/null +++ b/data/2024/iclr/Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps @@ -0,0 +1 @@ +Diffusion Probabilistic Models (DPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could exaggerate the problem of exposure bias due to the training and inference discrepancy. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DPM. In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce a sampling method named Time-Shift Sampler. Our framework can be seamlessly integrated to existing sampling algorithms, such as DDPM, DDIM and other high-order solvers, inducing merely minimal additional computations. Experimental results show our method brings significant and consistent improvements in FID scores on different datasets and sampling methods. For example, integrating Time-Shift Sampler to F-PNDM yields a FID=3.88, achieving 44.49\% improvements as compared to F-PNDM, on CIFAR-10 with 10 sampling steps, which is more performant than the vanilla DDIM with 100 sampling steps. Our code is available at https://github.com/Mingxiao-Li/TS-DPM. \ No newline at end of file diff --git a/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data b/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data new file mode 100644 index 0000000000..930fc0676f --- /dev/null +++ b/data/2024/iclr/AlpaGasus: Training a Better Alpaca with Fewer Data @@ -0,0 +1 @@ +Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/ \ No newline at end of file diff --git a/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter b/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter new file mode 100644 index 0000000000..12fcb2aa32 --- /dev/null +++ b/data/2024/iclr/Alt-Text with Context: Improving Accessibility for Images on Twitter @@ -0,0 +1 @@ +In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4. \ No newline at end of file diff --git a/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes b/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes new file mode 100644 index 0000000000..e17c51d048 --- /dev/null +++ b/data/2024/iclr/Amortized Network Intervention to Steer the Excitatory Point Processes @@ -0,0 +1 @@ +Excitatory point processes (i.e., event flows) occurring over dynamic graphs (i.e., evolving topologies) provide a fine-grained model to capture how discrete events may spread over time and space. How to effectively steer the event flows by modifying the dynamic graph structures presents an interesting problem, motivated by curbing the spread of infectious diseases through strategically locking down cities to mitigating traffic congestion via traffic light optimization. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortized Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Each task is solved by an H-step lookahead model-based reinforcement learning, where neural ODEs are introduced to model the dynamics of the excitatory point processes. Instead of simulating rollouts from the dynamics model, we derive an analytical mean-field approximation for the event flows given the dynamics, making the online planning more efficiently solvable. We empirically illustrate that this ANI approach substantially enhances policy learning for unseen dynamics and exhibits promising outcomes in steering event flows through network intervention using synthetic and real COVID datasets. \ No newline at end of file diff --git a/data/2024/iclr/AmortizedPeriod: Attention-based Amortized Inference for Periodicity Identification b/data/2024/iclr/AmortizedPeriod: Attention-based Amortized Inference for Periodicity Identification new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Amortizing intractable inference in large language models b/data/2024/iclr/Amortizing intractable inference in large language models new file mode 100644 index 0000000000..3b229e63bd --- /dev/null +++ b/data/2024/iclr/Amortizing intractable inference in large language models @@ -0,0 +1 @@ +Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use. \ No newline at end of file diff --git a/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression b/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression new file mode 100644 index 0000000000..4f9e79e09a --- /dev/null +++ b/data/2024/iclr/An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression @@ -0,0 +1 @@ +We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an"agnostic"view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022). \ No newline at end of file diff --git a/data/2024/iclr/An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment b/data/2024/iclr/An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization b/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization new file mode 100644 index 0000000000..4adb8d2ae7 --- /dev/null +++ b/data/2024/iclr/An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization @@ -0,0 +1 @@ +Recently, diffusion models have achieved remarkable success in generating tasks, including image and audio generation. However, like other generative models, diffusion models are prone to privacy issues. In this paper, we propose an efficient query-based membership inference attack (MIA), namely Proximal Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by $\epsilon$ initialized in $t=0$ and predicted point to infer memberships. Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. Moreover, previous works on the privacy of diffusion models have focused on vision tasks without considering audio tasks. Therefore, we also explore the robustness of diffusion models to MIA in the text-to-speech (TTS) task, which is an audio generation task. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the TTS task. Experimental results indicate that models with mel-spectrogram (image-like) output are vulnerable to MIA, while models with audio output are relatively robust to MIA. {Code is available at \url{https://github.com/kong13661/PIA}}. \ No newline at end of file diff --git a/data/2024/iclr/An Efficient Tester-Learner for Halfspaces b/data/2024/iclr/An Efficient Tester-Learner for Halfspaces new file mode 100644 index 0000000000..0f813dbb6f --- /dev/null +++ b/data/2024/iclr/An Efficient Tester-Learner for Halfspaces @@ -0,0 +1 @@ +We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan (2023). In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution -- e.g., the Gaussian -- must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is Gaussian (or more generally any strongly log-concave distribution) in $d$ dimensions and the noise model is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error $\mathsf{opt} + \epsilon$ for any strongly log-concave target distribution. For adversarial noise, our tester-learner obtains error $O(\mathsf{opt}) + \epsilon$ in polynomial time when the target distribution is Gaussian; for strongly log-concave distributions, we obtain $\tilde{O}(\mathsf{opt}) + \epsilon$ in quasipolynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. (2023). This enables us to simulate a variant of the algorithm of Diakonikolas et al. (2020) for learning noisy halfspaces using nonconvex SGD but in the testable learning setting. \ No newline at end of file diff --git a/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models b/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models new file mode 100644 index 0000000000..ae6d63f7d7 --- /dev/null +++ b/data/2024/iclr/An Emulator for Fine-tuning Large Language Models using Small Language Models @@ -0,0 +1 @@ +Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that uses targeted examples or other specifications of desired behaviors. While it has been hypothesized that knowledge and skills come from pre-training, and fine-tuning mostly filters this knowledge and skillset, this intuition has not been extensively tested. To aid in doing so, we introduce a novel technique for decoupling the knowledge and skills gained in these two stages, enabling a direct answer to the question,"What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)?"Using an RL-based framework derived from recent developments in learning from human preferences, we introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates (or 'emulates') the result of pre-training and fine-tuning at different scales. Our experiments with EFT show that scaling up fine-tuning tends to improve helpfulness, while scaling up pre-training tends to improve factuality. Beyond decoupling scale, we show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models, essentially emulating the result of fine-tuning the large pre-trained model. Up-scaling consistently improves helpfulness and factuality of instruction-following models in the Llama, Llama-2, and Falcon families, without additional hyperparameters or training. \ No newline at end of file diff --git a/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception b/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception new file mode 100644 index 0000000000..2c4cf1a747 --- /dev/null +++ b/data/2024/iclr/An Extensible Framework for Open Heterogeneous Collaborative Perception @@ -0,0 +1 @@ +Collaborative perception aims to mitigate the limitations of single-agent perception, such as occlusions, by facilitating data exchange among multiple agents. However, most current works consider a homogeneous scenario where all agents use identity sensors and perception models. In reality, heterogeneous agent types may continually emerge and inevitably face a domain gap when collaborating with existing agents. In this paper, we introduce a new open heterogeneous problem: how to accommodate continually emerging new heterogeneous agent types into collaborative perception, while ensuring high perception performance and low integration cost? To address this problem, we propose HEterogeneous ALliance (HEAL), a novel extensible collaborative perception framework. HEAL first establishes a unified feature space with initial agents via a novel multi-scale foreground-aware Pyramid Fusion network. When heterogeneous new agents emerge with previously unseen modalities or models, we align them to the established unified space with an innovative backward alignment. This step only involves individual training on the new agent type, thus presenting extremely low training costs and high extensibility. To enrich agents' data heterogeneity, we bring OPV2V-H, a new large-scale dataset with more diverse sensor types. Extensive experiments on OPV2V-H and DAIR-V2X datasets show that HEAL surpasses SOTA methods in performance while reducing the training parameters by 91.5% when integrating 3 new agent types. We further implement a comprehensive codebase at: https://github.com/yifanlu0227/HEAL \ No newline at end of file diff --git a/data/2024/iclr/An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models b/data/2024/iclr/An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks b/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks new file mode 100644 index 0000000000..ef594ab245 --- /dev/null +++ b/data/2024/iclr/An Intuitive Multi-Frequency Feature Representation for SO(3)-Equivariant Networks @@ -0,0 +1 @@ +The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper. \ No newline at end of file diff --git a/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning b/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning new file mode 100644 index 0000000000..751554a4ce --- /dev/null +++ b/data/2024/iclr/An Investigation of Representation and Allocation Harms in Contrastive Learning @@ -0,0 +1 @@ +The effect of underrepresentation on the performance of minority groups is known to be a serious problem in supervised learning settings; however, it has been underexplored so far in the context of self-supervised learning (SSL). In this paper, we demonstrate that contrastive learning (CL), a popular variant of SSL, tends to collapse representations of minority groups with certain majority groups. We refer to this phenomenon as representation harm and demonstrate it on image and text datasets using the corresponding popular CL methods. Furthermore, our causal mediation analysis of allocation harm on a downstream classification task reveals that representation harm is partly responsible for it, thus emphasizing the importance of studying and mitigating representation harm. Finally, we provide a theoretical explanation for representation harm using a stochastic block model that leads to a representational neural collapse in a contrastive learning setting. \ No newline at end of file diff --git a/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models b/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models new file mode 100644 index 0000000000..14e03e3178 --- /dev/null +++ b/data/2024/iclr/An Unforgeable Publicly Verifiable Watermark for Large Language Models @@ -0,0 +1 @@ +Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code is available at \href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}. \ No newline at end of file diff --git a/data/2024/iclr/An interpretable error correction method for enhancing code-to-code translation b/data/2024/iclr/An interpretable error correction method for enhancing code-to-code translation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning b/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning new file mode 100644 index 0000000000..9439f729c2 --- /dev/null +++ b/data/2024/iclr/An operator preconditioning perspective on training in physics-informed machine learning @@ -0,0 +1 @@ +In this paper, we investigate the behavior of gradient descent algorithms in physics-informed machine learning methods like PINNs, which minimize residuals connected to partial differential equations (PDEs). Our key result is that the difficulty in training these models is closely related to the conditioning of a specific differential operator. This operator, in turn, is associated to the Hermitian square of the differential operator of the underlying PDE. If this operator is ill-conditioned, it results in slow or infeasible training. Therefore, preconditioning this operator is crucial. We employ both rigorous mathematical analysis and empirical evaluations to investigate various strategies, explaining how they better condition this critical operator, and consequently improve training. \ No newline at end of file diff --git a/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity b/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity new file mode 100644 index 0000000000..7110b8a9f5 --- /dev/null +++ b/data/2024/iclr/Analysis of Learning a Flow-based Generative Model from Limited Sample Complexity @@ -0,0 +1 @@ +We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the mean of the generated mixture and the mean of the target mixture, which we show decays as $\Theta_n(\frac{1}{n})$. Finally, this rate is shown to be in fact Bayes-optimal. \ No newline at end of file diff --git a/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps b/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps new file mode 100644 index 0000000000..6ecac81228 --- /dev/null +++ b/data/2024/iclr/Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps @@ -0,0 +1 @@ +Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer. \ No newline at end of file diff --git a/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks b/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks new file mode 100644 index 0000000000..9573a13996 --- /dev/null +++ b/data/2024/iclr/Analyzing and Improving Optimal-Transport-based Adversarial Networks @@ -0,0 +1 @@ +Optimal Transport (OT) problem aims to find a transport plan that bridges two distributions while minimizing a given cost function. OT theory has been widely utilized in generative modeling. In the beginning, OT distance has been used as a measure for assessing the distance between data and generated distributions. Recently, OT transport map between data and prior distributions has been utilized as a generative model. These OT-based generative models share a similar adversarial training objective. In this paper, we begin by unifying these OT-based adversarial methods within a single framework. Then, we elucidate the role of each component in training dynamics through a comprehensive analysis of this unified framework. Moreover, we suggest a simple but novel method that improves the previously best-performing OT-based model. Intuitively, our approach conducts a gradual refinement of the generated distribution, progressively aligning it with the data distribution. Our approach achieves a FID score of 2.51 on CIFAR-10 and 5.99 on CelebA-HQ-256, outperforming unified OT-based adversarial approaches. \ No newline at end of file diff --git a/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models b/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models new file mode 100644 index 0000000000..4b92f4e610 --- /dev/null +++ b/data/2024/iclr/Analyzing and Mitigating Object Hallucination in Large Vision-Language Models @@ -0,0 +1 @@ +Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE. \ No newline at end of file diff --git a/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning b/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning new file mode 100644 index 0000000000..573e79c9a5 --- /dev/null +++ b/data/2024/iclr/AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning @@ -0,0 +1 @@ +With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff. \ No newline at end of file diff --git a/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training b/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training new file mode 100644 index 0000000000..7816901c8f --- /dev/null +++ b/data/2024/iclr/Annealing Self-Distillation Rectification Improves Adversarial Training @@ -0,0 +1 @@ +In standard adversarial training, models are optimized to fit one-hot labels within allowable adversarial perturbation budgets. However, the ignorance of underlying distribution shifts brought by perturbations causes the problem of robust overfitting. To address this issue and enhance adversarial robustness, we analyze the characteristics of robust models and identify that robust models tend to produce smoother and well-calibrated outputs. Based on the observation, we propose a simple yet effective method, Annealing Self-Distillation Rectification (ADR), which generates soft labels as a better guidance mechanism that accurately reflects the distribution shift under attack during adversarial training. By utilizing ADR, we can obtain rectified distributions that significantly improve model robustness without the need for pre-trained models or extensive extra computation. Moreover, our method facilitates seamless plug-and-play integration with other adversarial training techniques by replacing the hard labels in their objectives. We demonstrate the efficacy of ADR through extensive experiments and strong performances across datasets. \ No newline at end of file diff --git a/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection b/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection new file mode 100644 index 0000000000..e951b1c1d2 --- /dev/null +++ b/data/2024/iclr/AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection @@ -0,0 +1 @@ +Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, \eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP. \ No newline at end of file diff --git a/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? b/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? new file mode 100644 index 0000000000..53046ff9cb --- /dev/null +++ b/data/2024/iclr/AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? @@ -0,0 +1 @@ +Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned"counterfactual"prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT \ No newline at end of file diff --git a/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing b/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing new file mode 100644 index 0000000000..46f21b8873 --- /dev/null +++ b/data/2024/iclr/AnyText: Multilingual Visual Text Generation and Editing @@ -0,0 +1 @@ +Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology. \ No newline at end of file diff --git a/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization b/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization new file mode 100644 index 0000000000..e48b69b067 --- /dev/null +++ b/data/2024/iclr/Approximating Nash Equilibria in Normal-Form Games via Stochastic Optimization @@ -0,0 +1 @@ +We propose the first loss function for approximate Nash equilibria of normal-form games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard non-convex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous state-of-the-art approaches. \ No newline at end of file diff --git a/data/2024/iclr/ArchLock: Locking DNN Transferability at the Architecture Level with a Zero-Cost Binary Predictor b/data/2024/iclr/ArchLock: Locking DNN Transferability at the Architecture Level with a Zero-Cost Binary Predictor new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Are Bert Family Good Instruction Followers? A Study on Their Potential And Limitations b/data/2024/iclr/Are Bert Family Good Instruction Followers? A Study on Their Potential And Limitations new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning? b/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning? new file mode 100644 index 0000000000..dadbb7d86e --- /dev/null +++ b/data/2024/iclr/Are Human-generated Demonstrations Necessary for In-context Learning? @@ -0,0 +1 @@ +Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC. \ No newline at end of file diff --git a/data/2024/iclr/Are Models Biased on Text without Gender-related Language? b/data/2024/iclr/Are Models Biased on Text without Gender-related Language? new file mode 100644 index 0000000000..0313db3616 --- /dev/null +++ b/data/2024/iclr/Are Models Biased on Text without Gender-related Language? @@ -0,0 +1 @@ +Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at https://ucinlp.github.io/unstereo-eval. \ No newline at end of file diff --git a/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? b/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? new file mode 100644 index 0000000000..a295e19407 --- /dev/null +++ b/data/2024/iclr/Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? @@ -0,0 +1 @@ +Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain. \ No newline at end of file diff --git a/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition b/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition new file mode 100644 index 0000000000..6f00eccbd3 --- /dev/null +++ b/data/2024/iclr/Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition @@ -0,0 +1 @@ +The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impact such as Face Recognition. In this article, we prove asymptotic guarantees for empirical ROC curves of similarity functions as well as for by-product metrics useful to assess fairness. We also explain that, because the false acceptance/rejection rates are of the form of U-statistics in the case of similarity scoring, the naive bootstrap approach may jeopardize the assessment procedure. A dedicated recentering technique must be used instead. Beyond the theoretical analysis carried out, various experiments using real face image datasets provide strong empirical evidence of the practical relevance of the methods promoted here, when applied to several ROC-based measures such as popular fairness metrics. \ No newline at end of file diff --git a/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning b/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning new file mode 100644 index 0000000000..87749a2b48 --- /dev/null +++ b/data/2024/iclr/Asymptotically Free Sketched Ridge Ensembles: Risks, Cross-Validation, and Tuning @@ -0,0 +1 @@ +We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an"ensemble trick"whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms. \ No newline at end of file diff --git a/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning? b/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning? new file mode 100644 index 0000000000..2fcb8d7a4d --- /dev/null +++ b/data/2024/iclr/At Which Training Stage Does Code Data Help LLMs Reasoning? @@ -0,0 +1 @@ +Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage introducing code data can really help LLMs reasoning. To this end, this paper systematically explores the impact of code data on LLMs at different stages. Concretely, we introduce the code data at the pre-training stage, instruction-tuning stage, and both of them, respectively. Then, the reasoning capability of LLMs is comprehensively and fairly evaluated via six reasoning tasks in five domains. We critically analyze the experimental results and provide conclusions with insights. First, pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability almost without negative transfer on other tasks. Besides, at the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability. Moreover, the dynamic mixing strategy of code and text data assists LLMs to learn reasoning capability step-by-step during training. These insights deepen the understanding of LLMs regarding reasoning ability for their application, such as scientific question answering, legal support, etc. The source code and model parameters are released at the link:~\url{https://github.com/yingweima2022/CodeLLM}. \ No newline at end of file diff --git a/data/2024/iclr/AttEXplore: Attribution for Explanation with model parameters eXploration b/data/2024/iclr/AttEXplore: Attribution for Explanation with model parameters eXploration new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models b/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models new file mode 100644 index 0000000000..4756c94aff --- /dev/null +++ b/data/2024/iclr/Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models @@ -0,0 +1 @@ +We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the LLM interacts internally with factual constraints. We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations. We curate a suite of 10 datasets containing over 40,000 prompts to study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing attention patterns, that can predict factual errors and fine-grained constraint satisfaction, and allow early error identification. The approach and findings take another step towards using the mechanistic understanding of LLMs to enhance their reliability. \ No newline at end of file diff --git a/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning b/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning new file mode 100644 index 0000000000..3dd3284c8c --- /dev/null +++ b/data/2024/iclr/Attention-Guided Contrastive Role Representations for Multi-agent Reinforcement Learning @@ -0,0 +1 @@ +Real-world multi-agent tasks usually involve dynamic team composition with the emergence of roles, which should also be a key to efficient cooperation in multi-agent reinforcement learning (MARL). Drawing inspiration from the correlation between roles and agent's behavior patterns, we propose a novel framework of **A**ttention-guided **CO**ntrastive **R**ole representation learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge transfer, and skillful coordination across agents. First, we introduce mutual information maximization to formalize role representation learning, derive a contrastive learning objective, and concisely approximate the distribution of negative pairs. Second, we leverage an attention mechanism to prompt the global state to attend to learned role representations in value decomposition, implicitly guiding agent coordination in a skillful role space to yield more expressive credit assignment. Experiments on challenging StarCraft II micromanagement and Google research football tasks demonstrate the state-of-the-art performance of our method and its advantages over existing approaches. Our code is available at [https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM). \ No newline at end of file diff --git a/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation b/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation new file mode 100644 index 0000000000..ba61f50040 --- /dev/null +++ b/data/2024/iclr/Attention-based Iterative Decomposition for Tensor Product Representation @@ -0,0 +1 @@ +In recent research, Tensor Product Representation (TPR) is applied for the systematic generalization task of deep neural networks by learning the compositional structure of data. However, such prior works show limited performance in discovering and representing the symbolic structure from unseen test data because their decomposition to the structural representations was incomplete. In this work, we propose an Attention-based Iterative Decomposition (AID) module designed to enhance the decomposition operations for the structured representations encoded from the sequential input data with TPR. Our AID can be easily adapted to any TPR-based model and provides enhanced systematic decomposition through a competitive attention mechanism between input features and structured representations. In our experiments, AID shows effectiveness by significantly improving the performance of TPR-based prior works on the series of systematic generalization tasks. Moreover, in the quantitative and qualitative evaluations, AID produces more compositional and well-bound structural representations than other works. \ No newline at end of file diff --git a/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation b/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation new file mode 100644 index 0000000000..21f4705b43 --- /dev/null +++ b/data/2024/iclr/AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation @@ -0,0 +1 @@ +Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance degradation, due to the discrepancy between teachers' training data and real-world scenarios (student domain). The degradation stems from the portions of teachers' knowledge that are not applicable to the student domain. They are specific to the teacher domain and would undermine students' performance. Hence, selectively transferring teachers' appropriate knowledge becomes the primary challenge in DFKD. In this work, we propose a simple but effective method AuG-KD. It utilizes an uncertainty-guided and sample-specific anchor to align student-domain data with the teacher domain and leverages a generative method to progressively trade off the learning process between OOD knowledge distillation and domain-specific information learning via mixup learning. Extensive experiments in 3 datasets and 8 settings demonstrate the stability and superiority of our approach. Code available at https://github.com/IshiKura-a/AuG-KD . \ No newline at end of file diff --git a/data/2024/iclr/Augmented Bayesian Policy Search b/data/2024/iclr/Augmented Bayesian Policy Search new file mode 100644 index 0000000000..3ba2ce5982 --- /dev/null +++ b/data/2024/iclr/Augmented Bayesian Policy Search @@ -0,0 +1 @@ +Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes. \ No newline at end of file diff --git a/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations b/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations new file mode 100644 index 0000000000..064e8c5384 --- /dev/null +++ b/data/2024/iclr/Augmenting Transformers with Recursively Composed Multi-grained Representations @@ -0,0 +1 @@ +We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications. To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained representations fully contextualized with other spans. Moreover, the CIO layers can be jointly pre-trained with Transformers, making ReCAT enjoy scaling ability, strong performance, and interpretability at the same time. We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that ReCAT can significantly outperform vanilla Transformer models on all span-level tasks and baselines that combine recursive networks with Transformers on natural language inference tasks. More interestingly, the hierarchical structures induced by ReCAT exhibit strong consistency with human-annotated syntactic trees, indicating good interpretability brought by the CIO layers. \ No newline at end of file diff --git a/data/2024/iclr/AutoChunk: Automated Activation Chunk for Memory-Efficient Deep Learning Inference b/data/2024/iclr/AutoChunk: Automated Activation Chunk for Memory-Efficient Deep Learning Inference new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models b/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models new file mode 100644 index 0000000000..41d7dbffbc --- /dev/null +++ b/data/2024/iclr/AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models @@ -0,0 +1 @@ +The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively. \ No newline at end of file diff --git a/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ b/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ new file mode 100644 index 0000000000..abe487bb71 --- /dev/null +++ b/data/2024/iclr/AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ @@ -0,0 +1 @@ +Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available. \ No newline at end of file diff --git a/data/2024/iclr/Automatic Functional Differentiation in JAX b/data/2024/iclr/Automatic Functional Differentiation in JAX new file mode 100644 index 0000000000..3fada146fa --- /dev/null +++ b/data/2024/iclr/Automatic Functional Differentiation in JAX @@ -0,0 +1 @@ +We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at https://github.com/sail-sg/autofd . \ No newline at end of file diff --git a/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost b/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost new file mode 100644 index 0000000000..13f0f47e75 --- /dev/null +++ b/data/2024/iclr/Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost @@ -0,0 +1 @@ +We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. Extensive experiments with six tasks on NYU v2, CityScapes, and Taskonomy datasets using VGG, ResNet, and ViT backbones validate the promising performance. The codes are available at https://github.com/ethanygao/Aux-NAS. \ No newline at end of file diff --git a/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis b/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis new file mode 100644 index 0000000000..382f7358c9 --- /dev/null +++ b/data/2024/iclr/B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis @@ -0,0 +1 @@ +Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs. \ No newline at end of file diff --git a/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning b/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning new file mode 100644 index 0000000000..93f8bd0c7c --- /dev/null +++ b/data/2024/iclr/BECLR: Batch Enhanced Contrastive Few-Shot Learning @@ -0,0 +1 @@ +Learning quickly from very few labeled samples is a fundamental attribute that separates machines and humans in the era of deep representation learning. Unsupervised few-shot learning (U-FSL) aspires to bridge this gap by discarding the reliance on annotations at training time. Intrigued by the success of contrastive learning approaches in the realm of U-FSL, we structurally approach their shortcomings in both pretraining and downstream inference stages. We propose a novel Dynamic Clustered mEmory (DyCE) module to promote a highly separable latent representation space for enhancing positive sampling at the pretraining phase and infusing implicit class-level insights into unsupervised contrastive learning. We then tackle the, somehow overlooked yet critical, issue of sample bias at the few-shot inference stage. We propose an iterative Optimal Transport-based distribution Alignment (OpTA) strategy and demonstrate that it efficiently addresses the problem, especially in low-shot scenarios where FSL approaches suffer the most from sample bias. We later on discuss that DyCE and OpTA are two intertwined pieces of a novel end-to-end approach (we coin as BECLR), constructively magnifying each other's impact. We then present a suite of extensive quantitative and qualitative experimentation to corroborate that BECLR sets a new state-of-the-art across ALL existing U-FSL benchmarks (to the best of our knowledge), and significantly outperforms the best of the current baselines (codebase available at: https://github.com/stypoumic/BECLR). \ No newline at end of file diff --git a/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks b/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks new file mode 100644 index 0000000000..70300ede05 --- /dev/null +++ b/data/2024/iclr/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks @@ -0,0 +1 @@ +The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND. \ No newline at end of file diff --git a/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs b/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs new file mode 100644 index 0000000000..773b29f935 --- /dev/null +++ b/data/2024/iclr/BENO: Boundary-embedded Neural Operators for Elliptic PDEs @@ -0,0 +1 @@ +Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically cannot handle complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two branches of Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model extensively in elliptic PDEs with various boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96\%. Our source code can be found https://github.com/AI4Science-WestlakeU/beno.git. \ No newline at end of file diff --git a/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation b/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation new file mode 100644 index 0000000000..fcfe426459 --- /dev/null +++ b/data/2024/iclr/BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation @@ -0,0 +1 @@ +Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA. \ No newline at end of file diff --git a/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models b/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models new file mode 100644 index 0000000000..e7019fd0ed --- /dev/null +++ b/data/2024/iclr/BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models @@ -0,0 +1 @@ +Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance. \ No newline at end of file diff --git a/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection b/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection new file mode 100644 index 0000000000..c2081083a6 --- /dev/null +++ b/data/2024/iclr/BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection @@ -0,0 +1 @@ +We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer). \ No newline at end of file diff --git a/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization b/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization new file mode 100644 index 0000000000..3b29c35e5d --- /dev/null +++ b/data/2024/iclr/Backdoor Contrastive Learning via Bi-level Trigger Optimization @@ -0,0 +1 @@ +Contrastive Learning (CL) has attracted enormous attention due to its remarkable capability in unsupervised representation learning. However, recent works have revealed the vulnerability of CL to backdoor attacks: the feature extractor could be misled to embed backdoored data close to an attack target class, thus fooling the downstream predictor to misclassify it as the target. Existing attacks usually adopt a fixed trigger pattern and poison the training set with trigger-injected data, hoping for the feature extractor to learn the association between trigger and target class. However, we find that such fixed trigger design fails to effectively associate trigger-injected data with target class in the embedding space due to special CL mechanisms, leading to a limited attack success rate (ASR). This phenomenon motivates us to find a better backdoor trigger design tailored for CL framework. In this paper, we propose a bi-level optimization approach to achieve this goal, where the inner optimization simulates the CL dynamics of a surrogate victim, and the outer optimization enforces the backdoor trigger to stay close to the target throughout the surrogate CL procedure. Extensive experiments show that our attack can achieve a higher attack success rate (e.g., $99\%$ ASR on ImageNet-100) with a very low poisoning rate ($1\%$). Besides, our attack can effectively evade existing state-of-the-art defenses. Code is available at: https://github.com/SWY666/SSL-backdoor-BLTO. \ No newline at end of file diff --git a/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers b/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers new file mode 100644 index 0000000000..b2dc350eff --- /dev/null +++ b/data/2024/iclr/Backdoor Federated Learning by Poisoning Backdoor-Critical Layers @@ -0,0 +1 @@ +Federated learning (FL) has been widely deployed to enable machine learning training on sensitive data across distributed devices. However, the decentralized learning paradigm and heterogeneity of FL further extend the attack surface for backdoor attacks. Existing FL attack and defense methodologies typically focus on the whole model. None of them recognizes the existence of backdoor-critical (BC) layers-a small subset of layers that dominate the model vulnerabilities. Attacking the BC layers achieves equivalent effects as attacking the whole model but at a far smaller chance of being detected by state-of-the-art (SOTA) defenses. This paper proposes a general in-situ approach that identifies and verifies BC layers from the perspective of attackers. Based on the identified BC layers, we carefully craft a new backdoor attack methodology that adaptively seeks a fundamental balance between attacking effects and stealthiness under various defense strategies. Extensive experiments show that our BC layer-aware backdoor attacks can successfully backdoor FL under seven SOTA defenses with only 10% malicious clients and outperform the latest backdoor attack methods. \ No newline at end of file diff --git a/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency b/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency new file mode 100644 index 0000000000..1d235bf74d --- /dev/null +++ b/data/2024/iclr/Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency @@ -0,0 +1 @@ +Modern machine learning (ML) systems demand substantial training data, often resorting to external sources. Nevertheless, this practice renders them vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies have primarily focused on the identification of backdoored models or poisoned data characteristics, typically operating under the assumption of access to clean data. In this work, we delve into a relatively underexplored challenge: the automatic identification of backdoor data within a poisoned dataset, all under realistic conditions, i.e., without the need for additional clean data or without manually defining a threshold for backdoor detection. We draw an inspiration from the scaled prediction consistency (SPC) technique, which exploits the prediction invariance of poisoned data to an input scaling factor. Based on this, we pose the backdoor data identification problem as a hierarchical data splitting optimization problem, leveraging a novel SPC-based loss function as the primary optimization objective. Our innovation unfolds in several key aspects. First, we revisit the vanilla SPC method, unveiling its limitations in addressing the proposed backdoor identification problem. Subsequently, we develop a bi-level optimization-based approach to precisely identify backdoor data by minimizing the advanced SPC loss. Finally, we demonstrate the efficacy of our proposal against a spectrum of backdoor attacks, encompassing basic label-corrupted attacks as well as more sophisticated clean-label attacks, evaluated across various benchmark datasets. Experiment results show that our approach often surpasses the performance of current baselines in identifying backdoor data points, resulting in about 4%-36% improvement in average AUROC. Codes are available at https://github.com/OPTML-Group/BackdoorMSPC. \ No newline at end of file diff --git a/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models b/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models new file mode 100644 index 0000000000..c9f260ab1a --- /dev/null +++ b/data/2024/iclr/BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses. \ No newline at end of file diff --git a/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing b/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing new file mode 100644 index 0000000000..1dcf9c6246 --- /dev/null +++ b/data/2024/iclr/BadEdit: Backdooring Large Language Models by Model Editing @@ -0,0 +1 @@ +Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs. \ No newline at end of file diff --git a/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models b/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models new file mode 100644 index 0000000000..093335d8b6 --- /dev/null +++ b/data/2024/iclr/Balancing Act: Constraining Disparate Impact in Sparse Models @@ -0,0 +1 @@ +Model pruning is a popular approach to enable the deployment of large deep learning models on edge devices with restricted computational or storage capacities. Although sparse models achieve performance comparable to that of their dense counterparts at the level of the entire dataset, they exhibit high accuracy drops for some data sub-groups. Existing methods to mitigate this disparate impact induced by pruning (i) rely on surrogate metrics that address the problem indirectly and have limited interpretability; or (ii) scale poorly with the number of protected sub-groups in terms of computational cost. We propose a constrained optimization approach that directly addresses the disparate impact of pruning: our formulation bounds the accuracy change between the dense and sparse models, for each sub-group. This choice of constraints provides an interpretable success criterion to determine if a pruned model achieves acceptable disparity levels. Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups. \ No newline at end of file diff --git a/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation b/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation new file mode 100644 index 0000000000..9f9e04e6ed --- /dev/null +++ b/data/2024/iclr/Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation @@ -0,0 +1 @@ +We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design. \ No newline at end of file diff --git a/data/2024/iclr/BarLeRIa: An Efficient Tuning Framework for Referring Image Segmentation b/data/2024/iclr/BarLeRIa: An Efficient Tuning Framework for Referring Image Segmentation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Batch normalization is sufficient for universal function approximation in CNNs b/data/2024/iclr/Batch normalization is sufficient for universal function approximation in CNNs new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/BatchPrompt: Accomplish more with less b/data/2024/iclr/BatchPrompt: Accomplish more with less new file mode 100644 index 0000000000..536c1412d3 --- /dev/null +++ b/data/2024/iclr/BatchPrompt: Accomplish more with less @@ -0,0 +1 @@ +As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our comprehensive experimental evaluation demonstrates that BPE can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting(SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt with batch size 32, using just 9%-16% the number of LLM calls, Boolq accuracy 90.6% to 90.9% with 27.4% tokens, QQP accuracy 87.2% to 88.4% with 18.6% tokens, RTE accuracy 91.5% to 91.1% with 30.8% tokens). To the best of our knowledge, this is the first work to technically improve prompting efficiency of large language models. We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released. \ No newline at end of file diff --git a/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models b/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models new file mode 100644 index 0000000000..b23b8fdd30 --- /dev/null +++ b/data/2024/iclr/Batched Low-Rank Adaptation of Foundation Models @@ -0,0 +1 @@ +Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages. \ No newline at end of file diff --git a/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation b/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation new file mode 100644 index 0000000000..3d3b8aec03 --- /dev/null +++ b/data/2024/iclr/BatteryML: An Open-source Platform for Machine Learning on Battery Degradation @@ -0,0 +1 @@ +Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at https://github.com/microsoft/BatteryML. \ No newline at end of file diff --git a/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information b/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information new file mode 100644 index 0000000000..37cd7035df --- /dev/null +++ b/data/2024/iclr/Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information @@ -0,0 +1 @@ +It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}. \ No newline at end of file diff --git a/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference b/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference new file mode 100644 index 0000000000..71fbf29b09 --- /dev/null +++ b/data/2024/iclr/BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference @@ -0,0 +1 @@ +Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications. \ No newline at end of file diff --git a/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction b/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction new file mode 100644 index 0000000000..efd174f2f2 --- /dev/null +++ b/data/2024/iclr/BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction @@ -0,0 +1 @@ +As a novel and effective fine-tuning paradigm based on large-scale pre-trained language models (PLMs), prompt-tuning aims to reduce the gap between downstream tasks and pre-training objectives. While prompt-tuning has yielded continuous advancements in various tasks, such an approach still remains a persistent defect: prompt-tuning methods fail to generalize to specific few-shot patterns. From the perspective of distribution analyses, we disclose that the intrinsic issues behind the phenomenon are the over-multitudinous conceptual knowledge contained in PLMs and the abridged knowledge for target downstream domains, which jointly result in that PLMs mis-locate the knowledge distributions corresponding to the target domains in the universal knowledge embedding space. To this end, we intuitively explore to approximate the unabridged target domains of downstream tasks in a debiased manner, and then abstract such domains to generate discriminative prompts, thereby providing the de-ambiguous guidance for PLMs. Guided by such an intuition, we propose a simple yet effective approach, namely BayesPrompt, to learn prompts that contain the domain discriminative information against the interference from domain-irrelevant knowledge. BayesPrompt primitively leverages known distributions to approximate the debiased factual distributions of target domains and further uniformly samples certain representative features from the approximated distributions to generate the ultimate prompts for PLMs. We provide theoretical insights with the connection to domain adaptation. Empirically, our method achieves state-of-the-art performance on benchmarks. \ No newline at end of file diff --git a/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures b/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures new file mode 100644 index 0000000000..a6937154bd --- /dev/null +++ b/data/2024/iclr/Bayesian Bi-clustering of Neural Spiking Activity with Latent Structures @@ -0,0 +1 @@ +Modern neural recording techniques allow neuroscientists to obtain spiking activity of multiple neurons from different brain regions over long time periods, which requires new statistical methods to be developed for understanding structure of the large-scale data. In this paper, we develop a bi-clustering method to cluster the neural spiking activity spatially and temporally, according to their low-dimensional latent structures. The spatial (neuron) clusters are defined by the latent trajectories within each neural population, while the temporal (state) clusters are defined by (populationally) synchronous local linear dynamics shared with different periods. To flexibly extract the bi-clustering structure, we build the model non-parametrically, and develop an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distributions of model parameters. Validating our proposed MCMC algorithm through simulations, we find the method can recover unknown parameters and true bi-clustering structures successfully. We then apply the proposed bi-clustering method to multi-regional neural recordings under different experiment settings, where we find that simultaneously considering latent trajectories and spatial-temporal clustering structures can provide us with a more accurate and interpretable result. Overall, the proposed method provides scientific insights for large-scale (counting) time series with elongated recording periods, and it can potentially have application beyond neuroscience. \ No newline at end of file diff --git a/data/2024/iclr/Bayesian Coreset Optimization for Personalized Federated Learning b/data/2024/iclr/Bayesian Coreset Optimization for Personalized Federated Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models b/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models new file mode 100644 index 0000000000..7ae27e8b41 --- /dev/null +++ b/data/2024/iclr/Bayesian Low-rank Adaptation for Large Language Models @@ -0,0 +1 @@ +Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs. \ No newline at end of file diff --git a/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data b/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data new file mode 100644 index 0000000000..f6543014b5 --- /dev/null +++ b/data/2024/iclr/Bayesian Optimization through Gaussian Cox Process Models for Spatio-temporal Data @@ -0,0 +1 @@ +Bayesian optimization (BO) has established itself as a leading strategy for efficiently optimizing expensive-to-evaluate functions. Existing BO methods mostly rely on Gaussian process (GP) surrogate models and are not applicable to (doubly-stochastic) Gaussian Cox processes, where the observation process is modulated by a latent intensity function modeled as a GP. In this paper, we propose a novel maximum a posteriori inference of Gaussian Cox processes. It leverages the Laplace approximation and change of kernel technique to transform the problem into a new reproducing kernel Hilbert space, where it becomes more tractable computationally. It enables us to obtain both a functional posterior of the latent intensity function and the covariance of the posterior, thus extending existing works that often focus on specific link functions or estimating the posterior mean. Using the result, we propose a BO framework based on the Gaussian Cox process model and further develop a Nystr\"om approximation for efficient computation. Extensive evaluations on various synthetic and real-world datasets demonstrate significant improvement over state-of-the-art inference solutions for Gaussian Cox processes, as well as effective BO with a wide range of acquisition functions designed through the underlying Gaussian Cox process model. \ No newline at end of file diff --git a/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference b/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference new file mode 100644 index 0000000000..00085e8ce5 --- /dev/null +++ b/data/2024/iclr/Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference @@ -0,0 +1 @@ +Selection bias in recommender system arises from the recommendation process of system filtering and the interactive process of user selection. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but ignore the fact that potential outcomes for a given user-item pair may vary with the treatments assigned to other user-item pairs, named neighborhood effect. To fill the gap, this paper formally formulates the neighborhood effect as an interference problem from the perspective of causal inference and introduces a treatment representation to capture the neighborhood effect. On this basis, we propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effect. We further develop two new estimators for estimating the proposed ideal loss. We theoretically establish the connection between the proposed and previous debiasing methods ignoring the neighborhood effect, showing that the proposed methods can achieve unbiased learning when both selection bias and neighborhood effect are present, while the existing methods are biased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods. \ No newline at end of file diff --git a/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks b/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks new file mode 100644 index 0000000000..7a36a425a1 --- /dev/null +++ b/data/2024/iclr/Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks @@ -0,0 +1 @@ +Label smoothing -- using softened labels instead of hard ones -- is a widely adopted regularization method for deep learning, showing diverse benefits such as enhanced generalization and calibration. Its implications for preserving model privacy, however, have remained unexplored. To fill this gap, we investigate the impact of label smoothing on model inversion attacks (MIAs), which aim to generate class-representative samples by exploiting the knowledge encoded in a classifier, thereby inferring sensitive information about its training data. Through extensive analyses, we uncover that traditional label smoothing fosters MIAs, thereby increasing a model's privacy leakage. Even more, we reveal that smoothing with negative factors counters this trend, impeding the extraction of class-related information and leading to privacy preservation, beating state-of-the-art defenses. This establishes a practical and powerful novel way for enhancing model resilience against MIAs. \ No newline at end of file diff --git a/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design b/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design new file mode 100644 index 0000000000..e676fd837a --- /dev/null +++ b/data/2024/iclr/Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design @@ -0,0 +1 @@ +Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic. \ No newline at end of file diff --git a/data/2024/iclr/Beating Price of Anarchy and Gradient Descent without Regret in Potential Games b/data/2024/iclr/Beating Price of Anarchy and Gradient Descent without Regret in Potential Games new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Behaviour Distillation b/data/2024/iclr/Behaviour Distillation new file mode 100644 index 0000000000..7d77f978cd --- /dev/null +++ b/data/2024/iclr/Behaviour Distillation @@ -0,0 +1 @@ +Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights. \ No newline at end of file diff --git a/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations b/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations new file mode 100644 index 0000000000..c67fc07da6 --- /dev/null +++ b/data/2024/iclr/Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations @@ -0,0 +1 @@ +Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods. Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning. \ No newline at end of file diff --git a/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models b/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models new file mode 100644 index 0000000000..b90bb548c7 --- /dev/null +++ b/data/2024/iclr/Bellman Optimal Stepsize Straightening of Flow-Matching Models @@ -0,0 +1 @@ +Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the finetuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Stepsize Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the stepsizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints. Our code can be found at https://github.com/nguyenngocbaocmt02/BOSS. \ No newline at end of file diff --git a/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization b/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization new file mode 100644 index 0000000000..930e7c2fb3 --- /dev/null +++ b/data/2024/iclr/Benchmarking Algorithms for Federated Domain Generalization @@ -0,0 +1 @@ +While prior domain generalization (DG) benchmarks consider train-test dataset heterogeneity, we evaluate Federated DG which introduces federated learning (FL) specific challenges. Additionally, we explore domain-based heterogeneity in clients' local datasets - a realistic Federated DG scenario. Prior Federated DG evaluations are limited in terms of the number or heterogeneity of clients and dataset diversity. To address this gap, we propose an Federated DG benchmark methodology that enables control of the number and heterogeneity of clients and provides metrics for dataset difficulty. We then apply our methodology to evaluate 14 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG. Our results suggest that despite some progress, there remain significant performance gaps in Federated DG particularly when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Please check our extendable benchmark code here: https://github.com/inouye-lab/FedDG_Benchmark. \ No newline at end of file diff --git a/data/2024/iclr/Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate b/data/2024/iclr/Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data b/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data new file mode 100644 index 0000000000..92a19f09f2 --- /dev/null +++ b/data/2024/iclr/Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data @@ -0,0 +1 @@ +Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting -- achieving a perfect fit to training data with near-random performance on test data -- before transitioning ("grokking") to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a"grokking"phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a non-generalizable linear classifier after one step and gradually learns generalizable features in later steps. \ No newline at end of file diff --git a/data/2024/iclr/Bespoke Solvers for Generative Flow Models b/data/2024/iclr/Bespoke Solvers for Generative Flow Models new file mode 100644 index 0000000000..33517033b5 --- /dev/null +++ b/data/2024/iclr/Bespoke Solvers for Generative Flow Models @@ -0,0 +1 @@ +Diffusion or flow-based models are powerful generative paradigms that are notoriously hard to sample as samples are defined as solutions to high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) which require a large Number of Function Evaluations (NFE) to approximate well. Existing methods to alleviate the costly sampling process include model distillation and designing dedicated ODE solvers. However, distillation is costly to train and sometimes can deteriorate quality, while dedicated solvers still require relatively large NFE to produce high quality samples. In this paper we introduce"Bespoke solvers", a novel framework for constructing custom ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach optimizes an order consistent and parameter-efficient solver (e.g., with 80 learnable parameters), is trained for roughly 1% of the GPU time required for training the pre-trained model, and significantly improves approximation and generation quality compared to dedicated solvers. For example, a Bespoke solver for a CIFAR10 model produces samples with Fr\'echet Inception Distance (FID) of 2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 NFE. \ No newline at end of file diff --git a/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers b/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers new file mode 100644 index 0000000000..8f1fcd1a07 --- /dev/null +++ b/data/2024/iclr/Better Neural PDE Solvers Through Data-Free Mesh Movers @@ -0,0 +1 @@ +Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Amp\`ere equation without optimal mesh data. Secondly, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems. The code can be found at: https://github.com/Peiyannn/MM-PDE.git. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment b/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment new file mode 100644 index 0000000000..e76915abec --- /dev/null +++ b/data/2024/iclr/Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment @@ -0,0 +1 @@ +Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models b/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models new file mode 100644 index 0000000000..6a66a2e6f6 --- /dev/null +++ b/data/2024/iclr/Beyond Memorization: Violating Privacy via Inference with Large Language Models @@ -0,0 +1 @@ +Current privacy research on large language models (LLMs) primarily focuses on the issue of extracting memorized training data. At the same time, models' inference capabilities have increased drastically. This raises the key question of whether current LLMs could violate individuals' privacy by inferring personal attributes from text given at inference time. In this work, we present the first comprehensive study on the capabilities of pretrained LLMs to infer personal attributes from text. We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to $85\%$ top-1 and $95\%$ top-3 accuracy at a fraction of the cost ($100\times$) and time ($240\times$) required by humans. As people increasingly interact with LLM-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. Finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against LLM inference. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale. In the absence of working defenses, we advocate for a broader discussion around LLM privacy implications beyond memorization, striving for a wider privacy protection. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints b/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints new file mode 100644 index 0000000000..ccd46c5f34 --- /dev/null +++ b/data/2024/iclr/Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints @@ -0,0 +1 @@ +The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE). \ No newline at end of file diff --git a/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods b/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods new file mode 100644 index 0000000000..c9e7f799d7 --- /dev/null +++ b/data/2024/iclr/Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods @@ -0,0 +1 @@ +Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders b/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders new file mode 100644 index 0000000000..8783e99d56 --- /dev/null +++ b/data/2024/iclr/Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders @@ -0,0 +1 @@ +The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness b/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness new file mode 100644 index 0000000000..67ea8db7cd --- /dev/null +++ b/data/2024/iclr/Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness @@ -0,0 +1 @@ +Designing expressive Graph Neural Networks (GNNs) is a fundamental topic in the graph learning community. So far, GNN expressiveness has been primarily assessed via the Weisfeiler-Lehman (WL) hierarchy. However, such an expressivity measure has notable limitations: it is inherently coarse, qualitative, and may not well reflect practical requirements (e.g., the ability to encode substructures). In this paper, we introduce a unified framework for quantitatively studying the expressiveness of GNN architectures, addressing all the above limitations. Specifically, we identify a fundamental expressivity measure termed homomorphism expressivity, which quantifies the ability of GNN models to count graphs under homomorphism. Homomorphism expressivity offers a complete and practical assessment tool: the completeness enables direct expressivity comparisons between GNN models, while the practicality allows for understanding concrete GNN abilities such as subgraph counting. By examining four classes of prominent GNNs as case studies, we derive simple, unified, and elegant descriptions of their homomorphism expressivity for both invariant and equivariant settings. Our results provide novel insights into a series of previous work, unify the landscape of different subareas in the community, and settle several open questions. Empirically, extensive experiments on both synthetic and real-world tasks verify our theory, showing that the practical performance of GNN models aligns well with the proposed metric. \ No newline at end of file diff --git a/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies b/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies new file mode 100644 index 0000000000..224d962faa --- /dev/null +++ b/data/2024/iclr/Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies @@ -0,0 +1 @@ +In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a finite policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios. \ No newline at end of file diff --git a/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning b/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning new file mode 100644 index 0000000000..b75f069986 --- /dev/null +++ b/data/2024/iclr/Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning @@ -0,0 +1 @@ +Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL. \ No newline at end of file diff --git a/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs b/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs new file mode 100644 index 0000000000..862995c4ef --- /dev/null +++ b/data/2024/iclr/Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs @@ -0,0 +1 @@ +Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects. \ No newline at end of file diff --git a/data/2024/iclr/Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values b/data/2024/iclr/Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation b/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation new file mode 100644 index 0000000000..110bda4aa3 --- /dev/null +++ b/data/2024/iclr/Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation @@ -0,0 +1 @@ +We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence. \ No newline at end of file diff --git a/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis b/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis new file mode 100644 index 0000000000..8ddd182030 --- /dev/null +++ b/data/2024/iclr/Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis @@ -0,0 +1 @@ +Bilevel optimization is an important formulation for many machine learning problems. Current bilevel optimization algorithms assume that the gradient of the upper-level function is Lipschitz. However, recent studies reveal that certain neural networks such as recurrent neural networks (RNNs) and long-short-term memory networks (LSTMs) exhibit potential unbounded smoothness, rendering conventional bilevel optimization algorithms unsuitable. In this paper, we design a new bilevel optimization algorithm, namely BO-REP, to address this challenge. This algorithm updates the upper-level variable using normalized momentum and incorporates two novel techniques for updating the lower-level variable: \textit{initialization refinement} and \textit{periodic updates}. Specifically, once the upper-level variable is initialized, a subroutine is invoked to obtain a refined estimate of the corresponding optimal lower-level variable, and the lower-level variable is updated only after every specific period instead of each iteration. When the upper-level problem is nonconvex and unbounded smooth, and the lower-level problem is strongly convex, we prove that our algorithm requires $\widetilde{\mathcal{O}}(1/\epsilon^4)$ iterations to find an $\epsilon$-stationary point in the stochastic setting, where each iteration involves calling a stochastic gradient or Hessian-vector product oracle. Notably, this result matches the state-of-the-art complexity results under the bounded smoothness setting and without mean-squared smoothness of the stochastic gradient, up to logarithmic factors. Our proof relies on novel technical lemmas for the periodically updated lower-level variable, which are of independent interest. Our experiments on hyper-representation learning, hyperparameter optimization, and data hyper-cleaning for text classification tasks demonstrate the effectiveness of our proposed algorithm. \ No newline at end of file diff --git a/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs b/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs new file mode 100644 index 0000000000..a2cda27772 --- /dev/null +++ b/data/2024/iclr/BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs @@ -0,0 +1 @@ +Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks. However, FMs developed for biomedical domains have largely remained unimodal, i.e., independently trained and used for tasks on protein sequences alone, small molecule structures alone, or clinical data alone. To overcome this limitation of biomedical FMs, we present BioBridge, a novel parameter-efficient learning framework, to bridge independently trained unimodal FMs to establish multimodal behavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn transformations between one unimodal FM and another without fine-tuning any underlying unimodal FMs. Our empirical results demonstrate that BioBridge can beat the best baseline KG embedding methods (on average by around 76.3%) in cross-modal retrieval tasks. We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations. Additionally, we also show that BioBridge presents itself as a general purpose retriever that can aid biomedical multimodal question answering as well as enhance the guided generation of novel drugs. \ No newline at end of file diff --git a/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement b/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement new file mode 100644 index 0000000000..6390b464c0 --- /dev/null +++ b/data/2024/iclr/Blending Imitation and Reinforcement Learning for Robust Policy Improvement @@ -0,0 +1 @@ +While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains. \ No newline at end of file diff --git a/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World b/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World new file mode 100644 index 0000000000..9e185b36e9 --- /dev/null +++ b/data/2024/iclr/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World @@ -0,0 +1 @@ +We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs&VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities. \ No newline at end of file diff --git a/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs b/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs new file mode 100644 index 0000000000..3c56b69737 --- /dev/null +++ b/data/2024/iclr/BooookScore: A systematic exploration of book-length summarization in the era of LLMs @@ -0,0 +1 @@ +Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K USD and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than those generated by open-source models. While LLaMA 2 falls behind other models, Mixtral achieves performance on par with GPT-3.5-Turbo. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by annotators. \ No newline at end of file diff --git a/data/2024/iclr/Boosting Graph Anomaly Detection with Adaptive Message Passing b/data/2024/iclr/Boosting Graph Anomaly Detection with Adaptive Message Passing new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Boosting Vanilla Lightweight Vision Transformers via Re-parameterization b/data/2024/iclr/Boosting Vanilla Lightweight Vision Transformers via Re-parameterization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models b/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models new file mode 100644 index 0000000000..794a0f0bf1 --- /dev/null +++ b/data/2024/iclr/Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models @@ -0,0 +1 @@ +The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches. \ No newline at end of file diff --git a/data/2024/iclr/Boosting the Adversarial Robustness of Graph Neural Networks: An OOD Perspective b/data/2024/iclr/Boosting the Adversarial Robustness of Graph Neural Networks: An OOD Perspective new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Bootstrapping Variational Information Pursuit with Large Language and Vision Models for Interpretable Image Classification b/data/2024/iclr/Bootstrapping Variational Information Pursuit with Large Language and Vision Models for Interpretable Image Classification new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Boundary Denoising for Video Activity Localization b/data/2024/iclr/Boundary Denoising for Video Activity Localization new file mode 100644 index 0000000000..cf38ab0cfe --- /dev/null +++ b/data/2024/iclr/Boundary Denoising for Video Activity Localization @@ -0,0 +1 @@ +Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods. \ No newline at end of file diff --git a/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments b/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments new file mode 100644 index 0000000000..b2c7e0b719 --- /dev/null +++ b/data/2024/iclr/Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments @@ -0,0 +1 @@ +Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BoS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BoS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection. Code and data are available at https://github.com/YangYangGirl/BoS. \ No newline at end of file diff --git a/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks b/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks new file mode 100644 index 0000000000..db07bcc6e1 --- /dev/null +++ b/data/2024/iclr/Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthonormality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthonormal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods. Our code is publicly available at: \href{https://github.com/Sennadir/GCORN}{https://github.com/Sennadir/GCORN}. \ No newline at end of file diff --git a/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation b/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation new file mode 100644 index 0000000000..09c7a3d202 --- /dev/null +++ b/data/2024/iclr/Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation @@ -0,0 +1 @@ +State-of-the-art methods for conditional average treatment effect (CATE) estimation make widespread use of representation learning. Here, the idea is to reduce the variance of the low-sample CATE estimation by a (potentially constrained) low-dimensional representation. However, low-dimensional representations can lose information about the observed confounders and thus lead to bias, because of which the validity of representation learning for CATE estimation is typically violated. In this paper, we propose a new, representation-agnostic refutation framework for estimating bounds on the representation-induced confounding bias that comes from dimensionality reduction (or other constraints on the representations) in CATE estimation. First, we establish theoretically under which conditions CATE is non-identifiable given low-dimensional (constrained) representations. Second, as our remedy, we propose a neural refutation framework which performs partial identification of CATE or, equivalently, aims at estimating lower and upper bounds of the representation-induced confounding bias. We demonstrate the effectiveness of our bounds in a series of experiments. In sum, our refutation framework is of direct relevance in practice where the validity of CATE estimation is of importance. \ No newline at end of file diff --git a/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception b/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception new file mode 100644 index 0000000000..be8320a96b --- /dev/null +++ b/data/2024/iclr/Brain decoding: toward real-time reconstruction of visual perception @@ -0,0 +1 @@ +In the past five years, the use of generative and foundational AI systems has greatly improved the decoding of brain activity. Visual perception, in particular, can now be decoded from functional Magnetic Resonance Imaging (fMRI) with remarkable fidelity. This neuroimaging technique, however, suffers from a limited temporal resolution ($\approx$0.5 Hz) and thus fundamentally constrains its real-time usage. Here, we propose an alternative approach based on magnetoencephalography (MEG), a neuroimaging device capable of measuring brain activity with high temporal resolution ($\approx$5,000 Hz). For this, we develop an MEG decoding model trained with both contrastive and regression objectives and consisting of three modules: i) pretrained embeddings obtained from the image, ii) an MEG module trained end-to-end and iii) a pretrained image generator. Our results are threefold: Firstly, our MEG decoder shows a 7X improvement of image-retrieval over classic linear decoders. Second, late brain responses to images are best decoded with DINOv2, a recent foundational image model. Third, image retrievals and generations both suggest that high-level visual features can be decoded from MEG signals, although the same approach applied to 7T fMRI also recovers better low-level features. Overall, these results, while preliminary, provide an important step towards the decoding -- in real-time -- of the visual processes continuously unfolding within the human brain. \ No newline at end of file diff --git a/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity b/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity new file mode 100644 index 0000000000..b4bae1f2b3 --- /dev/null +++ b/data/2024/iclr/BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity @@ -0,0 +1 @@ +Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of"person"representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex. \ No newline at end of file diff --git a/data/2024/iclr/Branch-GAN: Improving Text Generation with (not so) Large Language Models b/data/2024/iclr/Branch-GAN: Improving Text Generation with (not so) Large Language Models new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages b/data/2024/iclr/Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Bridging Neural and Symbolic Representations with Transitional Dictionary Learning b/data/2024/iclr/Bridging Neural and Symbolic Representations with Transitional Dictionary Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL b/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL new file mode 100644 index 0000000000..564dd2956a --- /dev/null +++ b/data/2024/iclr/Bridging State and History Representations: Understanding Self-Predictive RL @@ -0,0 +1 @@ +Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners. \ No newline at end of file diff --git a/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction b/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction new file mode 100644 index 0000000000..21b98191ce --- /dev/null +++ b/data/2024/iclr/Bridging Vision and Language Spaces with Assignment Prediction @@ -0,0 +1 @@ +This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible. \ No newline at end of file diff --git a/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics b/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics new file mode 100644 index 0000000000..a6d034c9b6 --- /dev/null +++ b/data/2024/iclr/BroGNet: Momentum-Conserving Graph Neural Stochastic Differential Equation for Learning Brownian Dynamics @@ -0,0 +1 @@ +Neural networks (NNs) that exploit strong inductive biases based on physical laws and symmetries have shown remarkable success in learning the dynamics of physical systems directly from their trajectory. However, these works focus only on the systems that follow deterministic dynamics, such as Newtonian or Hamiltonian. Here, we propose a framework, namely Brownian graph neural networks (B RO GN ET ), combining stochastic differential equations (SDEs) and G NN s to learn Brownian dynamics directly from the trajectory. We modify the architecture of B RO GN ET to enforce linear momentum conservation of the system, which, in turn, provides superior performance on learning dynamics as revealed empirically. We demonstrate this approach on several systems, namely, linear spring, linear spring with binary particle types, and non-linear spring systems, all following Brownian dynamics at finite temperatures. We show that B RO GN ET significantly outperforms proposed baselines across all the benchmarked Brownian systems. In addition, we demonstrate zero-shot generalizability of B RO GN ET to simulate unseen system sizes that are two orders of magnitude larger and to different temperatures than those used during training. Finally, we show that B RO GN ET conserves the momentum of the system resulting in superior performance and data efficiency. Altogether, our study contributes to advancing the understanding of the intricate dynamics of Brownian motion and demonstrates the effectiveness of graph neural networks in modeling such complex systems. \ No newline at end of file diff --git a/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack b/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack new file mode 100644 index 0000000000..cdb0e0a198 --- /dev/null +++ b/data/2024/iclr/Brusleattack: a Query-Efficient Score- based Black-Box Sparse Adversarial Attack @@ -0,0 +1 @@ +We study the unique, less-well understood problem of generating sparse adversarial samples simply by observing the score-based replies to model queries. Sparse attacks aim to discover a minimum number-the l0 bounded-perturbations to model inputs to craft adversarial examples and misguide model decisions. But, in contrast to query-based dense attack counterparts against black-box models, constructing sparse adversarial perturbations, even when models serve confidence score information to queries in a score-based setting, is non-trivial. Because, such an attack leads to i) an NP-hard problem; and ii) a non-differentiable search space. We develop the BruSLeAttack-a new, faster (more query-efficient) Bayesian algorithm for the problem. We conduct extensive attack evaluations including an attack demonstration against a Machine Learning as a Service (MLaaS) offering exemplified by Google Cloud Vision and robustness testing of adversarial training regimes and a recent defense against black-box attacks. The proposed attack scales to achieve state-of-the-art attack success rates and query efficiency on standard computer vision tasks such as ImageNet across different model architectures. Our artefacts and DIY attack samples are available on GitHub. Importantly, our work facilitates faster evaluation of model vulnerabilities and raises our vigilance on the safety, security and reliability of deployed systems. \ No newline at end of file diff --git a/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models b/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models new file mode 100644 index 0000000000..794eeb20a7 --- /dev/null +++ b/data/2024/iclr/Building Cooperative Embodied Agents Modularly with Large Language Models @@ -0,0 +1 @@ +In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoELA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://vis-www.cs.umass.edu/Co-LLM-Agents/. \ No newline at end of file diff --git a/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression b/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression new file mode 100644 index 0000000000..022a343015 --- /dev/null +++ b/data/2024/iclr/Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression @@ -0,0 +1 @@ +This work studies training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards, despite negligibly affecting the behavior cloning loss. We empirically disentangle the statistical and computational causes of these oscillations, and find them to stem from the chaotic propagation of minibatch SGD noise through unstable closed-loop dynamics. While SGD noise is benign in the single-step action prediction objective, it results in catastrophic error accumulation over long horizons, an effect we term gradient variance amplification (GVA). We show that many standard mitigation techniques do not alleviate GVA, but find an exponential moving average (EMA) of iterates to be surprisingly effective at doing so. We illustrate the generality of this phenomenon by showing the existence of GVA and its amelioration by EMA in both continuous control and autoregressive language generation. Finally, we provide theoretical vignettes that highlight the benefits of EMA in alleviating GVA and shed light on the extent to which classical convex models can help in understanding the benefits of iterate averaging in deep learning. \ No newline at end of file diff --git a/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game b/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game new file mode 100644 index 0000000000..ca6d7a4709 --- /dev/null +++ b/data/2024/iclr/Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game @@ -0,0 +1 @@ +In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks. \ No newline at end of file diff --git a/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion b/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion new file mode 100644 index 0000000000..9cd627fef9 --- /dev/null +++ b/data/2024/iclr/C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion @@ -0,0 +1 @@ +In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT. \ No newline at end of file diff --git a/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering b/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering new file mode 100644 index 0000000000..d1c217d0fc --- /dev/null +++ b/data/2024/iclr/CABINET: Content Relevance-based Noise Reduction for Table Question Answering @@ -0,0 +1 @@ +Table understanding capability of Large Language Models (LLMs) has been extensively studied through the task of question-answering (QA) over tables. Typically, only a small part of the whole table is relevant to derive the answer for a given question. The irrelevant parts act as noise and are distracting information, resulting in sub-optimal performance due to the vulnerability of LLMs to noise. To mitigate this, we propose CABINET (Content RelevAnce-Based NoIse ReductioN for TablE QuesTion-Answering) - a framework to enable LLMs to focus on relevant tabular data by suppressing extraneous information. CABINET comprises an Unsupervised Relevance Scorer (URS), trained differentially with the QA LLM, that weighs the table content based on its relevance to the input question before feeding it to the question-answering LLM (QA LLM). To further aid the relevance scorer, CABINET employs a weakly supervised module that generates a parsing statement describing the criteria of rows and columns relevant to the question and highlights the content of corresponding table cells. CABINET significantly outperforms various tabular LLM baselines, as well as GPT3-based in-context learning methods, is more robust to noise, maintains outperformance on tables of varying sizes, and establishes new SoTA performance on WikiTQ, FeTaQA, and WikiSQL datasets. We release our code and datasets at https://github.com/Sohanpatnaik106/CABINET_QA. \ No newline at end of file diff --git a/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling b/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling new file mode 100644 index 0000000000..1f6402ceb5 --- /dev/null +++ b/data/2024/iclr/CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling @@ -0,0 +1 @@ +While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively. \ No newline at end of file diff --git a/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception b/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception new file mode 100644 index 0000000000..03844dc21c --- /dev/null +++ b/data/2024/iclr/CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception @@ -0,0 +1 @@ +Perception is crucial in the realm of autonomous driving systems, where bird's eye view (BEV)-based architectures have recently reached state-of-the-art performance. The desirability of self-supervised representation learning stems from the expensive and laborious process of annotating 2D and 3D data. Although previous research has investigated pretraining methods for both LiDAR and camera-based 3D object detection, a unified pretraining framework for multimodal BEV perception is missing. In this study, we introduce CALICO, a novel framework that applies contrastive objectives to both LiDAR and camera backbones. Specifically, CALICO incorporates two stages: point-region contrast (PRC) and region-aware distillation (RAD). PRC better balances the region- and scene-level representation learning on the LiDAR modality and offers significant performance improvement compared to existing methods. RAD effectively achieves contrastive distillation on our self-trained teacher model. CALICO's efficacy is substantiated by extensive evaluations on 3D object detection and BEV map segmentation tasks, where it delivers significant performance improvements. Notably, CALICO outperforms the baseline method by 10.5% and 8.6% on NDS and mAP. Moreover, CALICO boosts the robustness of multimodal 3D object detection against adversarial attacks and corruption. Additionally, our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception. \ No newline at end of file diff --git a/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching b/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching new file mode 100644 index 0000000000..34cc2c142f --- /dev/null +++ b/data/2024/iclr/CAMBranch: Contrastive Learning with Augmented MILPs for Branching @@ -0,0 +1 @@ +Recent advancements have introduced machine learning frameworks to enhance the Branch and Bound (B\&B) branching policies for solving Mixed Integer Linear Programming (MILP). These methods, primarily relying on imitation learning of Strong Branching, have shown superior performance. However, collecting expert samples for imitation learning, particularly for Strong Branching, is a time-consuming endeavor. To address this challenge, we propose \textbf{C}ontrastive Learning with \textbf{A}ugmented \textbf{M}ILPs for \textbf{Branch}ing (CAMBranch), a framework that generates Augmented MILPs (AMILPs) by applying variable shifting to limited expert data from their original MILPs. This approach enables the acquisition of a considerable number of labeled expert samples. CAMBranch leverages both MILPs and AMILPs for imitation learning and employs contrastive learning to enhance the model's ability to capture MILP features, thereby improving the quality of branching decisions. Experimental results demonstrate that CAMBranch, trained with only 10\% of the complete dataset, exhibits superior performance. Ablation studies further validate the effectiveness of our method. \ No newline at end of file diff --git a/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images b/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images new file mode 100644 index 0000000000..0a8d3031ba --- /dev/null +++ b/data/2024/iclr/CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images @@ -0,0 +1 @@ +The visual examination of tissue biopsy sections is fundamental for cancer diagnosis, with pathologists analyzing sections at multiple magnifications to discern tumor cells and their subtypes. However, existing attention-based multiple instance learning (MIL) models, used for analyzing Whole Slide Images (WSIs) in cancer diagnostics, often overlook the contextual information of tumor and neighboring tiles, leading to misclassifications. To address this, we propose the Context-Aware Multiple Instance Learning (CAMIL) architecture. CAMIL incorporates neighbor-constrained attention to consider dependencies among tiles within a WSI and integrates contextual constraints as prior knowledge into the MIL model. We evaluated CAMIL on subtyping non-small cell lung cancer (TCGA-NSCLC) and detecting lymph node (CAMELYON16) metastasis, achieving test AUCs of 0.959\% and 0.975\%, respectively, outperforming other state-of-the-art methods. Additionally, CAMIL enhances model interpretability by identifying regions of high diagnostic value. \ No newline at end of file diff --git a/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting b/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting new file mode 100644 index 0000000000..74814c0daf --- /dev/null +++ b/data/2024/iclr/CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting @@ -0,0 +1 @@ +Recent studies have demonstrated the great power of Transformer models for time series forecasting. One of the key elements that lead to the transformer's success is the channel-independent (CI) strategy to improve the training robustness. However, the ignorance of the correlation among different channels in CI would limit the model's forecasting capacity. In this work, we design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting. First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals and dynamical dependence among multiple variables over time. Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions. Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue. This new loss function weights the importance of forecasting over a finite horizon based on prediction uncertainties. Our evaluation of multiple long-term and short-term forecasting datasets demonstrates that CARD significantly outperforms state-of-the-art time series forecasting methods. The code is available at the following repository:https://github.com/wxie9/CARD \ No newline at end of file diff --git a/data/2024/iclr/CAS: A Probability-Based Approach for Universal Condition Alignment Score b/data/2024/iclr/CAS: A Probability-Based Approach for Universal Condition Alignment Score new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning b/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning new file mode 100644 index 0000000000..05cee5b0bd --- /dev/null +++ b/data/2024/iclr/CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning @@ -0,0 +1 @@ +We present a new technique to enhance the robustness of imitation learning methods by generating corrective data to account for compounding errors and disturbances. While existing methods rely on interactive expert labeling, additional offline datasets, or domain-specific invariances, our approach requires minimal additional assumptions beyond access to expert data. The key insight is to leverage local continuity in the environment dynamics to generate corrective labels. Our method first constructs a dynamics model from the expert demonstration, encouraging local Lipschitz continuity in the learned model. In locally continuous regions, this model allows us to generate corrective labels within the neighborhood of the demonstrations but beyond the actual set of states and actions in the dataset. Training on this augmented data enhances the agent's ability to recover from perturbations and deal with compounding errors. We demonstrate the effectiveness of our generated labels through experiments in a variety of robotics domains in simulation that have distinct forms of continuity and discontinuity, including classic control problems, drone flying, navigation with high-dimensional sensor observations, legged locomotion, and tabletop manipulation. \ No newline at end of file diff --git a/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis b/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis new file mode 100644 index 0000000000..b29798d7a5 --- /dev/null +++ b/data/2024/iclr/CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis @@ -0,0 +1 @@ +Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W. \ No newline at end of file diff --git a/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning b/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning new file mode 100644 index 0000000000..6bbea76fb6 --- /dev/null +++ b/data/2024/iclr/CLAP: Collaborative Adaptation for Patchwork Learning @@ -0,0 +1 @@ +our \ No newline at end of file diff --git a/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models b/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models new file mode 100644 index 0000000000..ddb958a42f --- /dev/null +++ b/data/2024/iclr/CLEX: Continuous Length Extrapolation for Large Language Models @@ -0,0 +1 @@ +Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX. \ No newline at end of file diff --git a/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? b/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? new file mode 100644 index 0000000000..fb3ced9333 --- /dev/null +++ b/data/2024/iclr/CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? @@ -0,0 +1 @@ +We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems. \ No newline at end of file diff --git a/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding b/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding new file mode 100644 index 0000000000..a0f00a2947 --- /dev/null +++ b/data/2024/iclr/CLIP-MUSED: CLIP-Guided Multi-Subject Visual Neural Information Semantic Decoding @@ -0,0 +1 @@ +The study of decoding visual neural information faces challenges in generalizing single-subject decoding models to multiple subjects, due to individual differences. Moreover, the limited availability of data from a single subject has a constraining impact on model performance. Although prior multi-subject decoding methods have made significant progress, they still suffer from several limitations, including difficulty in extracting global neural response features, linear scaling of model parameters with the number of subjects, and inadequate characterization of the relationship between neural responses of different subjects to various stimuli. To overcome these limitations, we propose a CLIP-guided Multi-sUbject visual neural information SEmantic Decoding (CLIP-MUSED) method. Our method consists of a Transformer-based feature extractor to effectively model global neural representations. It also incorporates learnable subject-specific tokens that facilitates the aggregation of multi-subject data without a linear increase of parameters. Additionally, we employ representational similarity analysis (RSA) to guide token representation learning based on the topological relationship of visual stimuli in the representation space of CLIP, enabling full characterization of the relationship between neural responses of different subjects under different stimuli. Finally, token representations are used for multi-subject semantic decoding. Our proposed method outperforms single-subject decoding methods and achieves state-of-the-art performance among the existing multi-subject methods on two fMRI datasets. Visualization results provide insights into the effectiveness of our proposed method. Code is available at https://github.com/CLIP-MUSED/CLIP-MUSED. \ No newline at end of file diff --git a/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction b/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction new file mode 100644 index 0000000000..ea4d75e74c --- /dev/null +++ b/data/2024/iclr/CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction @@ -0,0 +1 @@ +Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf. \ No newline at end of file diff --git a/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech b/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech new file mode 100644 index 0000000000..b492a8b0d2 --- /dev/null +++ b/data/2024/iclr/CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech @@ -0,0 +1 @@ +With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances. \ No newline at end of file diff --git a/data/2024/iclr/CNN Kernels Can Be the Best Shapelets b/data/2024/iclr/CNN Kernels Can Be the Best Shapelets new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap b/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap new file mode 100644 index 0000000000..e8b2451f7a --- /dev/null +++ b/data/2024/iclr/CO2: Efficient Distributed Training with Full Communication-Computation Overlap @@ -0,0 +1 @@ +The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training with limited bandwidth clusters. We propose a new approach called CO2 that introduces local-updating and asynchronous communication to the distributed data-parallel training, thereby facilitating the full overlap of COmunication with COmputation. CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth. We further propose the staleness gap penalty and outer momentum clipping techniques together with CO2 to bolster its convergence and training stability. Besides, CO2 exhibits seamless integration with well-established ZeRO-series optimizers which mitigate memory consumption of model states with large model training. We also provide a mathematical proof of convergence, accompanied by the establishment of a stringent upper bound. Furthermore, we validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. The outcomes emphasize the outstanding capacity of CO2 to hugely improve scalability, no matter on clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections. \ No newline at end of file diff --git a/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery b/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery new file mode 100644 index 0000000000..234c768bbb --- /dev/null +++ b/data/2024/iclr/COCO-Periph: Bridging the Gap Between Human and Machine Perception in the Periphery @@ -0,0 +1 @@ +Evaluating deep neural networks (DNNs) as models of human perception has given rich insights into both human visual processing and representational properties of DNNs. We extend this work by analyzing how well DNNs perform compared to humans when constrained by peripheral vision – which limits human performance on a variety of tasks, but also benefits the visual system significantly. We evaluate this by (1) modifying the texture tiling model (TTM), a well tested model of peripheral vision, to be more flexibly used with DNNs, (2) generating a large dataset which we call COCO-Periph that contains images transformed to capture the information available in human peripheral vision, and (3) comparing DNNs to humans at peripheral object detection using a psychophysics experiment. Our results show that common DNNs underperform at object detection compared to humans when simulating peripheral vision with TTM. Training on COCO-Periph begins to reduce the gap between human and DNN performance and leads to small increases in corruption robustness, but DNNs still struggle to capture human-like sensitivity to peripheral clutter. Our work brings us closer to accurately modeling human vision, and paves the way for DNNs to mimic and sometimes benefit from properties of human visual processing. \ No newline at end of file diff --git a/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits b/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits new file mode 100644 index 0000000000..73f4fb2edc --- /dev/null +++ b/data/2024/iclr/COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits @@ -0,0 +1 @@ +Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2. \ No newline at end of file diff --git a/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks b/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks new file mode 100644 index 0000000000..1450c12605 --- /dev/null +++ b/data/2024/iclr/COLLIE: Systematic Construction of Constrained Text Generation Tasks @@ -0,0 +1 @@ +Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g.,generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g.,language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 2080 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future. \ No newline at end of file diff --git a/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL b/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL new file mode 100644 index 0000000000..bf5cc1e346 --- /dev/null +++ b/data/2024/iclr/COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL @@ -0,0 +1 @@ +Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$. \ No newline at end of file diff --git a/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects b/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects new file mode 100644 index 0000000000..9f93b61159 --- /dev/null +++ b/data/2024/iclr/CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects @@ -0,0 +1 @@ +Nonprehensile manipulation is essential for manipulating objects that are too thin, large, or otherwise ungraspable in the wild. To sidestep the difficulty of contact modeling in conventional modeling-based approaches, reinforcement learning (RL) has recently emerged as a promising alternative. However, previous RL approaches either lack the ability to generalize over diverse object shapes, or use simple action primitives that limit the diversity of robot motions. Furthermore, using RL over diverse object geometry is challenging due to the high cost of training a policy that takes in high-dimensional sensory inputs. We propose a novel contact-based object representation and pretraining pipeline to tackle this. To enable massively parallel training, we leverage a lightweight patch-based transformer architecture for our encoder that processes point clouds, thus scaling our training across thousands of environments. Compared to learning from scratch, or other shape representation baselines, our representation facilitates both time- and data-efficient learning. We validate the efficacy of our overall system by zero-shot transferring the trained policy to novel real-world objects. Code and videos are available at https://sites.google.com/view/contact-non-prehensile. \ No newline at end of file diff --git a/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model b/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model new file mode 100644 index 0000000000..a8d1adb12c --- /dev/null +++ b/data/2024/iclr/COSA: Concatenated Sample Pretrained Vision-Language Foundation Model @@ -0,0 +1 @@ +Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA. \ No newline at end of file diff --git a/data/2024/iclr/CPPO: Continual Learning for Reinforcement Learning with Human Feedback b/data/2024/iclr/CPPO: Continual Learning for Reinforcement Learning with Human Feedback new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets b/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets new file mode 100644 index 0000000000..78debd6089 --- /dev/null +++ b/data/2024/iclr/CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets @@ -0,0 +1 @@ +Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrained by general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by prompting GPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved by scaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity. The code is available at https://github.com/lifan-yuan/CRAFT. \ No newline at end of file diff --git a/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing b/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing new file mode 100644 index 0000000000..48384167b7 --- /dev/null +++ b/data/2024/iclr/CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing @@ -0,0 +1 @@ +Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially"black boxes"to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs. \ No newline at end of file diff --git a/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion b/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion new file mode 100644 index 0000000000..0f234b7e59 --- /dev/null +++ b/data/2024/iclr/Cameras as Rays: Pose Estimation via Ray Diffusion @@ -0,0 +1 @@ +Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures. \ No newline at end of file diff --git a/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs b/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs new file mode 100644 index 0000000000..1f4a9f8c75 --- /dev/null +++ b/data/2024/iclr/Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs @@ -0,0 +1 @@ +Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs. \ No newline at end of file diff --git a/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory b/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory new file mode 100644 index 0000000000..9f0810654d --- /dev/null +++ b/data/2024/iclr/Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory @@ -0,0 +1 @@ +The interactive use of large language models (LLMs) in AI assistants (at work, home, etc.) introduces a new set of inference-time privacy risks: LLMs are fed different types of information from multiple sources in their inputs and are expected to reason about what to share in their outputs, for what purpose and with whom, within a given context. In this work, we draw attention to the highly critical yet overlooked notion of contextual privacy by proposing ConfAIde, a benchmark designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. Our experiments show that even the most capable models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. This leakage persists even when we employ privacy-inducing prompts or chain-of-thought reasoning. Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind. \ No newline at end of file diff --git a/data/2024/iclr/Can Large Language Models Infer Causation from Correlation? b/data/2024/iclr/Can Large Language Models Infer Causation from Correlation? new file mode 100644 index 0000000000..57427295bc --- /dev/null +++ b/data/2024/iclr/Can Large Language Models Infer Causation from Correlation? @@ -0,0 +1 @@ +Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause. \ No newline at end of file diff --git a/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks b/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks new file mode 100644 index 0000000000..048a62f90e --- /dev/null +++ b/data/2024/iclr/Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks @@ -0,0 +1 @@ +Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover"deleted"information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models. \ No newline at end of file diff --git a/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects? b/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects? new file mode 100644 index 0000000000..c4639e1e4b --- /dev/null +++ b/data/2024/iclr/Can Transformers Capture Spatial Relations between Objects? @@ -0,0 +1 @@ +Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple"RelatiViT"architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}. \ No newline at end of file diff --git a/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? b/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? new file mode 100644 index 0000000000..6f9aec5bda --- /dev/null +++ b/data/2024/iclr/Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? @@ -0,0 +1 @@ +Unsupervised domain adaptation (UDA) involves adapting a model trained on a label-rich source domain to an unlabeled target domain. However, in real-world scenarios, the absence of target-domain labels makes it challenging to evaluate the performance of UDA models. Furthermore, prevailing UDA methods relying on adversarial training and self-training could lead to model degeneration and negative transfer, further exacerbating the evaluation problem. In this paper, we propose a novel metric called the \textit{Transfer Score} to address these issues. The proposed metric enables the unsupervised evaluation of UDA models by assessing the spatial uniformity of the classifier via model parameters, as well as the transferability and discriminability of deep representations. Based on the metric, we achieve three novel objectives without target-domain labels: (1) selecting the best UDA method from a range of available options, (2) optimizing hyperparameters of UDA models to prevent model degeneration, and (3) identifying which checkpoint of UDA model performs optimally. Our work bridges the gap between data-level UDA research and practical UDA scenarios, enabling a realistic assessment of UDA model performance. We validate the effectiveness of our metric through extensive empirical studies on UDA datasets of different scales and imbalanced distributions. The results demonstrate that our metric robustly achieves the aforementioned goals. \ No newline at end of file diff --git a/data/2024/iclr/Can we get the best of both Binary Neural Networks and Spiking Neural Networks for Efficient Computer Vision? b/data/2024/iclr/Can we get the best of both Binary Neural Networks and Spiking Neural Networks for Efficient Computer Vision? new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Cascading Reinforcement Learning b/data/2024/iclr/Cascading Reinforcement Learning new file mode 100644 index 0000000000..2e046e0f5a --- /dev/null +++ b/data/2024/iclr/Cascading Reinforcement Learning @@ -0,0 +1 @@ +Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice. \ No newline at end of file diff --git a/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation b/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation new file mode 100644 index 0000000000..23b2c9b43a --- /dev/null +++ b/data/2024/iclr/Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation @@ -0,0 +1 @@ +The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as"jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM. \ No newline at end of file diff --git a/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression b/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression new file mode 100644 index 0000000000..41f433295f --- /dev/null +++ b/data/2024/iclr/Cauchy-Schwarz Divergence Information Bottleneck for Regression @@ -0,0 +1 @@ +The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}. \ No newline at end of file diff --git a/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework b/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework new file mode 100644 index 0000000000..49827052a1 --- /dev/null +++ b/data/2024/iclr/Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework @@ -0,0 +1 @@ +Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications. \ No newline at end of file diff --git a/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder b/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder new file mode 100644 index 0000000000..f838a74c4b --- /dev/null +++ b/data/2024/iclr/Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder @@ -0,0 +1 @@ +An essential and challenging problem in causal inference is causal effect estimation from observational data. The problem becomes more difficult with the presence of unobserved confounding variables. The front-door adjustment is a practical approach for dealing with unobserved confounding variables. However, the restriction for the standard front-door adjustment is difficult to satisfy in practice. In this paper, we relax some of the restrictions by proposing the concept of conditional front-door (CFD) adjustment and develop the theorem that guarantees the causal effect identifiability of CFD adjustment. Furthermore, as it is often impossible for a CFD variable to be given in practice, it is desirable to learn it from data. By leveraging the ability of deep generative models, we propose CFDiVAE to learn the representation of the CFD adjustment variable directly from data with the identifiable Variational AutoEncoder and formally prove the model identifiability. Extensive experiments on synthetic datasets validate the effectiveness of CFDiVAE and its superiority over existing methods. The experiments also show that the performance of CFDiVAE is less sensitive to the causal strength of unobserved confounding variables. We further apply CFDiVAE to a real-world dataset to demonstrate its potential application. \ No newline at end of file diff --git a/data/2024/iclr/Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning b/data/2024/iclr/Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Causal Structure Recovery with Latent Variables under Milder Distributional and Graphical Assumptions b/data/2024/iclr/Causal Structure Recovery with Latent Variables under Milder Distributional and Graphical Assumptions new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data b/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data new file mode 100644 index 0000000000..94a7ee8eab --- /dev/null +++ b/data/2024/iclr/Causal-StoNet: Causal Inference for High-Dimensional Complex Data @@ -0,0 +1 @@ +With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, this paper proposes a novel causal inference approach for dealing with high-dimensional complex data. The proposed approach is based on deep learning techniques, including sparse deep learning theory and stochastic neural networks, that have been developed in recent literature. By using these techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones. \ No newline at end of file diff --git a/data/2024/iclr/CausalLM is not optimal for in-context learning b/data/2024/iclr/CausalLM is not optimal for in-context learning new file mode 100644 index 0000000000..56a866a64d --- /dev/null +++ b/data/2024/iclr/CausalLM is not optimal for in-context learning @@ -0,0 +1 @@ +Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings. \ No newline at end of file diff --git a/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery b/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery new file mode 100644 index 0000000000..5988f87978 --- /dev/null +++ b/data/2024/iclr/CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery @@ -0,0 +1 @@ +Time-series causal discovery (TSCD) is a fundamental problem of machine learning. However, existing synthetic datasets cannot properly evaluate or predict the algorithms' performance on real data. This study introduces the CausalTime pipeline to generate time-series that highly resemble the real data and with ground truth causal graphs for quantitative performance evaluation. The pipeline starts from real observations in a specific scenario and produces a matching benchmark dataset. Firstly, we harness deep neural networks along with normalizing flow to accurately capture realistic dynamics. Secondly, we extract hypothesized causal graphs by performing importance analysis on the neural network or leveraging prior knowledge. Thirdly, we derive the ground truth causal graphs by splitting the causal model into causal term, residual term, and noise term. Lastly, using the fitted network and the derived causal graph, we generate corresponding versatile time-series proper for algorithm assessment. In the experiments, we validate the fidelity of the generated data through qualitative and quantitative experiments, followed by a benchmarking of existing TSCD algorithms using these generated datasets. CausalTime offers a feasible solution to evaluating TSCD algorithms in real applications and can be generalized to a wide range of fields. For easy use of the proposed approach, we also provide a user-friendly website, hosted on www.causaltime.cc. \ No newline at end of file diff --git a/data/2024/iclr/Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks b/data/2024/iclr/Causality-Inspired Spatial-Temporal Explanations for Dynamic Graph Neural Networks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Causally Aligned Curriculum Learning b/data/2024/iclr/Causally Aligned Curriculum Learning new file mode 100644 index 0000000000..41622b4720 --- /dev/null +++ b/data/2024/iclr/Causally Aligned Curriculum Learning @@ -0,0 +1 @@ +, \ No newline at end of file diff --git a/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells b/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells new file mode 100644 index 0000000000..a352edabcb --- /dev/null +++ b/data/2024/iclr/CellPLM: Pre-training of Cell Language Model Beyond Single Cells @@ -0,0 +1 @@ +The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian mixture prior distribution as an additional inductive bias to overcome data limitation. CellPLM is the first single-cell pre-trained transformer that encodes cell-cell relations and it consistently outperforms existing pre-trained and non-pre-trained models in diverse downstream tasks, with 100x times higher inference speed compared to existing pre-trained models. \ No newline at end of file diff --git a/data/2024/iclr/Certified Adversarial Robustness for Rate Encoded Spiking Neural Networks b/data/2024/iclr/Certified Adversarial Robustness for Rate Encoded Spiking Neural Networks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback b/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback new file mode 100644 index 0000000000..f610f78959 --- /dev/null +++ b/data/2024/iclr/Chain of Hindsight aligns Language Models with Feedback @@ -0,0 +1 @@ +Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of language models. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to large language models, we observed that Chain of Hindsight significantly surpasses previous methods in aligning language models with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations. \ No newline at end of file diff --git a/data/2024/iclr/Chain of Log-Concave Markov Chains b/data/2024/iclr/Chain of Log-Concave Markov Chains new file mode 100644 index 0000000000..2c330d43e1 --- /dev/null +++ b/data/2024/iclr/Chain of Log-Concave Markov Chains @@ -0,0 +1 @@ +We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi&Hyv\"arinen, 2019). The"walk"phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The"jump"from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to"tunnel"between modes of a distribution. \ No newline at end of file diff --git a/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems b/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems new file mode 100644 index 0000000000..0c684bcfeb --- /dev/null +++ b/data/2024/iclr/Chain of Thought Empowers Transformers to Solve Inherently Serial Problems @@ -0,0 +1 @@ +Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers. \ No newline at end of file diff --git a/data/2024/iclr/Chain-of-Experts: When LLMs Meet Complex Operations Research Problems b/data/2024/iclr/Chain-of-Experts: When LLMs Meet Complex Operations Research Problems new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources b/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources new file mode 100644 index 0000000000..ecff2a32f1 --- /dev/null +++ b/data/2024/iclr/Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources @@ -0,0 +1 @@ +We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains. If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains. These corrected rationales can plausibly serve as a better foundation for the final answer consolidation. Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information. To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales. Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains. \ No newline at end of file diff --git a/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding b/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding new file mode 100644 index 0000000000..20fe1d39d1 --- /dev/null +++ b/data/2024/iclr/Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding @@ -0,0 +1 @@ +Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices. \ No newline at end of file diff --git a/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning b/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning new file mode 100644 index 0000000000..e2ace01d0e --- /dev/null +++ b/data/2024/iclr/Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning @@ -0,0 +1 @@ +The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs. \ No newline at end of file diff --git a/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words b/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words new file mode 100644 index 0000000000..31a59ffc17 --- /dev/null +++ b/data/2024/iclr/Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words @@ -0,0 +1 @@ +Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT. \ No newline at end of file diff --git a/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate b/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate new file mode 100644 index 0000000000..72cfaf7987 --- /dev/null +++ b/data/2024/iclr/ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate @@ -0,0 +1 @@ +Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval. \ No newline at end of file diff --git a/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models b/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models new file mode 100644 index 0000000000..53973d7080 --- /dev/null +++ b/data/2024/iclr/Circuit Component Reuse Across Tasks in Transformer Language Models @@ -0,0 +1 @@ +Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito&Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components. \ No newline at end of file diff --git a/data/2024/iclr/CircuitNet 2.0: An Advanced Dataset for Promoting Machine Learning Innovations in Realistic Chip Design Environment b/data/2024/iclr/CircuitNet 2.0: An Advanced Dataset for Promoting Machine Learning Innovations in Realistic Chip Design Environment new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models b/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models new file mode 100644 index 0000000000..77c44445be --- /dev/null +++ b/data/2024/iclr/Circumventing Concept Erasure Methods For Text-To-Image Generative Models @@ -0,0 +1 @@ +Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to"erase"sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve"erased"concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety. \ No newline at end of file diff --git a/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents b/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents new file mode 100644 index 0000000000..e9c3402181 --- /dev/null +++ b/data/2024/iclr/CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents @@ -0,0 +1 @@ +The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization's profound alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique learning and reasoning challenge for decision-making agents. The code is available at https://github.com/bigai-ai/civrealm. \ No newline at end of file diff --git a/data/2024/iclr/Class Probability Matching with Calibrated Networks for Label Shift Adaption b/data/2024/iclr/Class Probability Matching with Calibrated Networks for Label Shift Adaption new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Classification with Conceptual Safeguards b/data/2024/iclr/Classification with Conceptual Safeguards new file mode 100644 index 0000000000..a0bfa33e74 --- /dev/null +++ b/data/2024/iclr/Classification with Conceptual Safeguards @@ -0,0 +1 @@ +We propose a new approach to promote safety in classification tasks with established concepts. Our approach – called a conceptual safeguard – acts as a verification layer for models that predict a target outcome by first predicting the presence of intermediate concepts. Given this architecture, a safeguard ensures that a model meets a minimal level of accuracy by abstaining from uncertain predictions. In contrast to a standard selective classifier, a safeguard provides an avenue to improve coverage by allowing a human to confirm the presence of uncertain concepts on instances on which it abstains. We develop methods to build safeguards that maximize coverage without compromising safety, namely techniques to propagate the uncertainty in concept predictions and to flag salient concepts for human review. We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks \ No newline at end of file diff --git a/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform b/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform new file mode 100644 index 0000000000..bd689f7ff3 --- /dev/null +++ b/data/2024/iclr/Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform @@ -0,0 +1 @@ +Distributed Deep Reinforcement Learning (DRL) aims to leverage more computational resources to train autonomous agents with less training time. Despite recent progress in the field, reproducibility issues have not been sufficiently explored. This paper first shows that the typical actor-learner framework can have reproducibility issues even if hyperparameters are controlled. We then introduce Cleanba, a new open-source platform for distributed DRL that proposes a highly reproducible architecture. Cleanba implements highly optimized distributed variants of PPO and IMPALA. Our Atari experiments show that these variants can obtain equivalent or higher scores than strong IMPALA baselines in moolib and torchbeast and PPO baseline in CleanRL. However, Cleanba variants present 1) shorter training time and 2) more reproducible learning curves in different hardware settings. Cleanba's source code is available at \url{https://github.com/vwxyzjn/cleanba} \ No newline at end of file diff --git a/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks b/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks new file mode 100644 index 0000000000..08278b4e9d --- /dev/null +++ b/data/2024/iclr/Clifford Group Equivariant Simplicial Message Passing Networks @@ -0,0 +1 @@ +We introduce Clifford Group Equivariant Simplicial Message Passing Networks, a method for steerable E(n)-equivariant message passing on simplicial complexes. Our method integrates the expressivity of Clifford group-equivariant layers with simplicial message passing, which is topologically more intricate than regular graph message passing. Clifford algebras include higher-order objects such as bivectors and trivectors, which express geometric features (e.g., areas, volumes) derived from vectors. Using this knowledge, we represent simplex features through geometric products of their vertices. To achieve efficient simplicial message passing, we share the parameters of the message network across different dimensions. Additionally, we restrict the final message to an aggregation of the incoming messages from different dimensions, leading to what we term shared simplicial message passing. Experimental results show that our method is able to outperform both equivariant and simplicial graph neural networks on a variety of geometric tasks. \ No newline at end of file diff --git a/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration b/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration new file mode 100644 index 0000000000..57f643a6a6 --- /dev/null +++ b/data/2024/iclr/Closing the Curious Case of Neural Text Degeneration @@ -0,0 +1 @@ +Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this type of algorithm. Our evaluations show that our method outperforms its threshold-based counterparts under automatic and human evaluation metrics for low-entropy (i.e., close to greedy) open-ended text generation. Our theoretical findings and pilot experiments provide both insight into why truncation sampling works, and make progress toward more expressive sampling algorithms that better surface the generative capabilities of large language models. \ No newline at end of file diff --git a/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View b/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View new file mode 100644 index 0000000000..c5ce621c60 --- /dev/null +++ b/data/2024/iclr/Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View @@ -0,0 +1 @@ +Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text. \ No newline at end of file diff --git a/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model b/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model new file mode 100644 index 0000000000..22e337ff4d --- /dev/null +++ b/data/2024/iclr/CoBIT: A Contrastive Bi-directional Image-Text Generation Model @@ -0,0 +1 @@ +The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning. \ No newline at end of file diff --git a/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation b/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation new file mode 100644 index 0000000000..43ac8c0786 --- /dev/null +++ b/data/2024/iclr/CoLiDE: Concomitant Linear DAG Estimation @@ -0,0 +1 @@ +We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the $\textit{unknown}$ SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from the exogenous noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of noise variances in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator. \ No newline at end of file diff --git a/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs b/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs new file mode 100644 index 0000000000..82c73afc19 --- /dev/null +++ b/data/2024/iclr/CoRe-GD: A Hierarchical Framework for Scalable Graph Visualization with GNNs @@ -0,0 +1 @@ +Graph Visualization, also known as Graph Drawing, aims to find geometric embeddings of graphs that optimize certain criteria. Stress is a widely used metric; stress is minimized when every pair of nodes is positioned at their shortest path distance. However, stress optimization presents computational challenges due to its inherent complexity and is usually solved using heuristics in practice. We introduce a scalable Graph Neural Network (GNN) based Graph Drawing framework with sub-quadratic runtime that can learn to optimize stress. Inspired by classical stress optimization techniques and force-directed layout algorithms, we create a coarsening hierarchy for the input graph. Beginning at the coarsest level, we iteratively refine and un-coarsen the layout, until we generate an embedding for the original graph. To enhance information propagation within the network, we propose a novel positional rewiring technique based on intermediate node positions. Our empirical evaluation demonstrates that the framework achieves state-of-the-art performance while remaining scalable. \ No newline at end of file diff --git a/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding b/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding new file mode 100644 index 0000000000..431cf5310a --- /dev/null +++ b/data/2024/iclr/CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding @@ -0,0 +1 @@ +3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data. The code is available at https:eslambakr.github.io/cot3dref.github.io/. \ No newline at end of file diff --git a/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding b/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding new file mode 100644 index 0000000000..ab5910070c --- /dev/null +++ b/data/2024/iclr/CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding @@ -0,0 +1 @@ +A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make"infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their"bag-of-words"behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering. \ No newline at end of file diff --git a/data/2024/iclr/Code Representation Learning at Scale b/data/2024/iclr/Code Representation Learning at Scale new file mode 100644 index 0000000000..7344ed8124 --- /dev/null +++ b/data/2024/iclr/Code Representation Learning at Scale @@ -0,0 +1 @@ +Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size. \ No newline at end of file diff --git a/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules b/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules new file mode 100644 index 0000000000..923a2dc54f --- /dev/null +++ b/data/2024/iclr/CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules @@ -0,0 +1 @@ +Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success. \ No newline at end of file diff --git a/data/2024/iclr/Coeditor: Leveraging Repo-level Diffs for Code Auto-editing b/data/2024/iclr/Coeditor: Leveraging Repo-level Diffs for Code Auto-editing new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback b/data/2024/iclr/Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning b/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning new file mode 100644 index 0000000000..768f95206f --- /dev/null +++ b/data/2024/iclr/Combining Axes Preconditioners through Kronecker Approximation for Deep Learning @@ -0,0 +1 @@ +Adaptive regularization based optimization methods such as full-matrix Adagrad which use gradient second-moment information hold significant potential for fast convergence in deep neural network (DNN) training, but are memory intensive and computationally demanding for large neural nets. We develop a technique called Combining AxeS PReconditioners (CASPR), which optimizes matrix-shaped DNN parameters by finding different preconditioners for each mode/axis of the parameter and combining them using a Kronecker-sum based approximation. The Kronecker-sum based combination allows us to show that CASPR is ordered between a well-known Kronecker product based combination, Shampoo, and full-matrix Adagrad preconditioners in Loewner order, as a result, it is nearer to full-matrix Adagrad than Shampoo. We also show tighter convergence guarantees in stochastic optimization compared to Shampoo. Furthermore, our experiments demonstrates that CASPR approximates the gradient second-moment matrix in full-matrix Adagrad more accurately, and shows significant improvement in training and generalization performance compared to existing practical adaptive regularization based methods such as Shampoo and Adam in a variety of tasks including graph neural network on OGBG-molpcba, Transformer on a universal dependencies dataset and auto-regressive large language modeling on C4 dataset. \ No newline at end of file diff --git a/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization b/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization new file mode 100644 index 0000000000..9f8ed63498 --- /dev/null +++ b/data/2024/iclr/Communication-Efficient Federated Non-Linear Bandit Optimization @@ -0,0 +1 @@ +Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm. \ No newline at end of file diff --git a/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates b/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates new file mode 100644 index 0000000000..387c0a430c --- /dev/null +++ b/data/2024/iclr/Communication-Efficient Gradient Descent-Accent Methods for Distributed Variational Inequalities: Unified Analysis and Local Updates @@ -0,0 +1 @@ +Distributed and federated learning algorithms and techniques associated primarily with minimization problems. However, with the increase of minimax optimization and variational inequality problems in machine learning, the necessity of designing efficient distributed/federated learning approaches for these problems is becoming more apparent. In this paper, we provide a unified convergence analysis of communication-efficient local training methods for distributed variational inequality problems (VIPs). Our approach is based on a general key assumption on the stochastic estimates that allows us to propose and analyze several novel local training algorithms under a single framework for solving a class of structured non-monotone VIPs. We present the first local gradient descent-accent algorithms with provable improved communication complexity for solving distributed variational inequalities on heterogeneous data. The general algorithmic framework recovers state-of-the-art algorithms and their sharp convergence guarantees when the setting is specialized to minimization or minimax optimization problems. Finally, we demonstrate the strong performance of the proposed algorithms compared to state-of-the-art methods when solving federated minimax optimization problems. \ No newline at end of file diff --git a/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models b/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models new file mode 100644 index 0000000000..77871e4315 --- /dev/null +++ b/data/2024/iclr/CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models @@ -0,0 +1 @@ +A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities. \ No newline at end of file diff --git a/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction b/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction new file mode 100644 index 0000000000..faa42b91d3 --- /dev/null +++ b/data/2024/iclr/Complete and Efficient Graph Transformers for Crystal Material Property Prediction @@ -0,0 +1 @@ +Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS). \ No newline at end of file diff --git a/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities b/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities new file mode 100644 index 0000000000..21afbd0ef3 --- /dev/null +++ b/data/2024/iclr/Complex priors and flexible inference in recurrent circuits with dendritic nonlinearities @@ -0,0 +1 @@ +Despite many successful examples in which probabilistic inference can account for perception, we have little understanding of how the brain represents and uses structured priors that capture the complexity of natural input statistics. Here we construct a recurrent circuit model that can implicitly represent priors over latent variables, and combine them with sensory and contextual sources of information to encode task-specific posteriors. Inspired by the recent success of diffusion models as means of learning and using priors over images, our model uses dendritic nonlinearities optimized for denoising, and stochastic somatic integration with the degree of noise modulated by an oscillating global signal. Combining these elements into a recurrent network yields a dynamical system that samples from the prior at a rate prescribed by the period of the global oscillator. Additional inputs reflecting sensory or top-down contextual information alter these dynamics to generate samples from the corresponding posterior, with different input gating patterns selecting different inference tasks. We demonstrate that this architecture can sample from low dimensional nonlinear manifolds and multimodal posteriors. Overall, the model provides a new framework for circuit-level representation of probabilistic information, in a format that facilitates flexible inference. \ No newline at end of file diff --git a/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis b/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis new file mode 100644 index 0000000000..e10e7b4513 --- /dev/null +++ b/data/2024/iclr/Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis @@ -0,0 +1 @@ +Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/ \ No newline at end of file diff --git a/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization b/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization new file mode 100644 index 0000000000..59b35d2dd7 --- /dev/null +++ b/data/2024/iclr/Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization @@ -0,0 +1 @@ +We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the one-to-one distance between a pair of specific points, which is not aligned with the one-to-many coarse-grained retrieval process and compromises the recall rate. In an attempt to fill this gap, we introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval by considering the multi-grained uncertainty. The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively. Specifically, our method contains two modules: uncertainty modeling and uncertainty regularization. (1) The uncertainty modeling simulates the multi-grained queries by introducing identically distributed fluctuations in the feature space. (2) Based on the uncertainty modeling, we further introduce uncertainty regularization to adapt the matching objective according to the fluctuation range. Compared with existing methods, the proposed strategy explicitly prevents the model from pushing away potential candidates in the early stage, and thus improves the recall rate. On the three public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline, respectively. \ No newline at end of file diff --git a/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning b/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning new file mode 100644 index 0000000000..703c9ee277 --- /dev/null +++ b/data/2024/iclr/Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning @@ -0,0 +1 @@ +Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at https://github.com/runamu/compositional-conservatism. \ No newline at end of file diff --git a/data/2024/iclr/Compositional Generative Inverse Design b/data/2024/iclr/Compositional Generative Inverse Design new file mode 100644 index 0000000000..557f5f025b --- /dev/null +++ b/data/2024/iclr/Compositional Generative Inverse Design @@ -0,0 +1 @@ +Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method generalizes to more objects for N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task. Project website and code can be found at https://github.com/AI4Science-WestlakeU/cindm. \ No newline at end of file diff --git a/data/2024/iclr/Compositional Preference Models for Aligning LMs b/data/2024/iclr/Compositional Preference Models for Aligning LMs new file mode 100644 index 0000000000..45aa9084e7 --- /dev/null +++ b/data/2024/iclr/Compositional Preference Models for Aligning LMs @@ -0,0 +1 @@ +As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way. \ No newline at end of file diff --git a/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple b/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple new file mode 100644 index 0000000000..627d70f8dd --- /dev/null +++ b/data/2024/iclr/Compressing LLMs: The Truth is Rarely Pure and Never Simple @@ -0,0 +1 @@ +Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick. \ No newline at end of file diff --git a/data/2024/iclr/Compressing Latent Space via Least Volume b/data/2024/iclr/Compressing Latent Space via Least Volume new file mode 100644 index 0000000000..1cfaa9527a --- /dev/null +++ b/data/2024/iclr/Compressing Latent Space via Least Volume @@ -0,0 +1 @@ +This paper introduces Least Volume-a simple yet effective regularization inspired by geometric intuition-that can reduce the necessary number of latent dimensions needed by an autoencoder without requiring any prior knowledge of the intrinsic dimensionality of the dataset. We show that the Lipschitz continuity of the decoder is the key to making it work, provide a proof that PCA is just a linear special case of it, and reveal that it has a similar PCA-like importance ordering effect when applied to nonlinear models. We demonstrate the intuition behind the regularization on some pedagogical toy problems, and its effectiveness on several benchmark problems, including MNIST, CIFAR-10 and CelebA. \ No newline at end of file diff --git a/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression b/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression new file mode 100644 index 0000000000..ee6efa75fc --- /dev/null +++ b/data/2024/iclr/ConR: Contrastive Regularizer for Deep Imbalanced Regression @@ -0,0 +1 @@ +Imbalanced distributions are ubiquitous in real-world data. They create constraints on Deep Neural Networks to represent the minority labels and avoid bias towards majority labels. The extensive body of imbalanced approaches address categorical label spaces but fail to effectively extend to regression problems where the label space is continuous. Local and global correlations among continuous labels provide valuable insights towards effectively modelling relationships in feature space. In this work, we propose ConR, a contrastive regularizer that models global and local label similarities in feature space and prevents the features of minority samples from being collapsed into their majority neighbours. ConR discerns the disagreements between the label space and feature space and imposes a penalty on these disagreements. ConR addresses the continuous nature of label space with two main strategies in a contrastive manner: incorrect proximities are penalized proportionate to the label similarities and the correct ones are encouraged to model local similarities. ConR consolidates essential considerations into a generic, easy-to-integrate, and efficient method that effectively addresses deep imbalanced regression. Moreover, ConR is orthogonal to existing approaches and smoothly extends to uni- and multi-dimensional label spaces. Our comprehensive experiments show that ConR significantly boosts the performance of all the state-of-the-art methods on four large-scale deep imbalanced regression benchmarks. Our code is publicly available in https://github.com/BorealisAI/ConR. \ No newline at end of file diff --git a/data/2024/iclr/Concept Bottleneck Generative Models b/data/2024/iclr/Concept Bottleneck Generative Models new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Conditional Information Bottleneck Approach for Time Series Imputation b/data/2024/iclr/Conditional Information Bottleneck Approach for Time Series Imputation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference b/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference new file mode 100644 index 0000000000..d1cc8c505a --- /dev/null +++ b/data/2024/iclr/Conditional Instrumental Variable Regression with Representation Learning for Causal Inference @@ -0,0 +1 @@ +This paper studies the challenging problem of estimating causal effects from observational data, in the presence of unobserved confounders. The two-stage least square (TSLS) method and its variants with a standard instrumental variable (IV) are commonly used to eliminate confounding bias, including the bias caused by unobserved confounders, but they rely on the linearity assumption. Besides, the strict condition of unconfounded instruments posed on a standard IV is too strong to be practical. To address these challenging and practical problems of the standard IV method (linearity assumption and the strict condition), in this paper, we use a conditional IV (CIV) to relax the unconfounded instrument condition of standard IV and propose a non-linear CIV regression with Confounding Balancing Representation Learning, CBRL.CIV, for jointly eliminating the confounding bias from unobserved confounders and balancing the observed confounders, without the linearity assumption. We theoretically demonstrate the soundness of CBRL.CIV. Extensive experiments on synthetic and two real-world datasets show the competitive performance of CBRL.CIV against state-of-the-art IV-based estimators and superiority in dealing with the non-linear situation. \ No newline at end of file diff --git a/data/2024/iclr/Conditional Variational Diffusion Models b/data/2024/iclr/Conditional Variational Diffusion Models new file mode 100644 index 0000000000..81f909a4ec --- /dev/null +++ b/data/2024/iclr/Conditional Variational Diffusion Models @@ -0,0 +1 @@ +Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results. \ No newline at end of file diff --git a/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models b/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models new file mode 100644 index 0000000000..b6716ffbcb --- /dev/null +++ b/data/2024/iclr/Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models @@ -0,0 +1 @@ +Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models, a phenomenon known as reward overoptimization. To investigate this issue in depth, we introduce the Text-Image Alignment Assessment (TIA2) benchmark, which comprises a diverse collection of text prompts, images, and human annotations. Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective. To address this, we propose TextNorm, a simple method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts. We demonstrate that incorporating the confidence-calibrated rewards in fine-tuning effectively reduces overoptimization, resulting in twice as many wins in human evaluation for text-image alignment compared against the baseline reward models. \ No newline at end of file diff --git a/data/2024/iclr/Confidential-DPproof: Confidential Proof of Differentially Private Training b/data/2024/iclr/Confidential-DPproof: Confidential Proof of Differentially Private Training new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Conformal Inductive Graph Neural Networks b/data/2024/iclr/Conformal Inductive Graph Neural Networks new file mode 100644 index 0000000000..ea89153920 --- /dev/null +++ b/data/2024/iclr/Conformal Inductive Graph Neural Networks @@ -0,0 +1 @@ +Conformal prediction (CP) transforms any model's output into prediction sets guaranteed to include (cover) the true label. CP requires exchangeability, a relaxation of the i.i.d. assumption, to obtain a valid distribution-free coverage guarantee. This makes it directly applicable to transductive node-classification. However, conventional CP cannot be applied in inductive settings due to the implicit shift in the (calibration) scores caused by message passing with the new nodes. We fix this issue for both cases of node and edge-exchangeable graphs, recovering the standard coverage guarantee without sacrificing statistical efficiency. We further prove that the guarantee holds independently of the prediction time, e.g. upon arrival of a new node/edge or at any subsequent moment. \ No newline at end of file diff --git a/data/2024/iclr/Conformal Language Modeling b/data/2024/iclr/Conformal Language Modeling new file mode 100644 index 0000000000..7d42fb7f5b --- /dev/null +++ b/data/2024/iclr/Conformal Language Modeling @@ -0,0 +1 @@ +We propose a novel approach to conformal prediction for generative language models (LMs). Standard conformal prediction produces prediction sets -- in place of single predictions -- that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model's predicted distribution over the large, combinatorial output space of natural language. Translating this process to conformal prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples may be low-quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e., small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets of individual components -- such as phrases or sentences -- that are each independently correct (e.g., that are not"hallucinations"), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants. \ No newline at end of file diff --git a/data/2024/iclr/Conformal Risk Control b/data/2024/iclr/Conformal Risk Control new file mode 100644 index 0000000000..e975a95c0d --- /dev/null +++ b/data/2024/iclr/Conformal Risk Control @@ -0,0 +1 @@ +Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance. \ No newline at end of file diff --git a/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF b/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF new file mode 100644 index 0000000000..1ec5d7e69c --- /dev/null +++ b/data/2024/iclr/Confronting Reward Model Overoptimization with Constrained RLHF @@ -0,0 +1 @@ +Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run. \ No newline at end of file diff --git a/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection b/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection new file mode 100644 index 0000000000..fec7139f9a --- /dev/null +++ b/data/2024/iclr/ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection @@ -0,0 +1 @@ +Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively. \ No newline at end of file diff --git a/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data b/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data new file mode 100644 index 0000000000..ee519ae2b6 --- /dev/null +++ b/data/2024/iclr/Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data @@ -0,0 +1 @@ +Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation. \ No newline at end of file diff --git a/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers b/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers new file mode 100644 index 0000000000..78489bfac8 --- /dev/null +++ b/data/2024/iclr/Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers @@ -0,0 +1 @@ +Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms. \ No newline at end of file diff --git a/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning b/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning new file mode 100644 index 0000000000..1ccb818051 --- /dev/null +++ b/data/2024/iclr/Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning @@ -0,0 +1 @@ +Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods. \ No newline at end of file diff --git a/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training b/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training new file mode 100644 index 0000000000..afaaa40b4d --- /dev/null +++ b/data/2024/iclr/Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training @@ -0,0 +1 @@ +Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset- and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating"robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research. \ No newline at end of file diff --git a/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision b/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision new file mode 100644 index 0000000000..2a7c47c1f2 --- /dev/null +++ b/data/2024/iclr/Consistency Training with Learnable Data Augmentation for Graph Anomaly Detection with Limited Supervision @@ -0,0 +1 @@ +conduct extensive experiments on four benchmark datasets, alongside one real-world dataset derived from a production environment. The ensuing results highlight the superiority of our proposed C ONSIS GAD, as it exhibits enhanced performance in comparison to state-of-the-1 \ No newline at end of file diff --git a/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion b/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion new file mode 100644 index 0000000000..1895d08e22 --- /dev/null +++ b/data/2024/iclr/Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion @@ -0,0 +1 @@ +Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods from the diffusion community. This access also enables the computation of likelihood. The code is available at https://github.com/sony/ctm. \ No newline at end of file diff --git a/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models b/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models new file mode 100644 index 0000000000..e550cb384c --- /dev/null +++ b/data/2024/iclr/Consistency-guided Prompt Learning for Vision-Language Models @@ -0,0 +1 @@ +We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models. Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input serves to further regularize the consistency constraint, thereby improving generalization. Moreover, the integration of adapters and prompts not only enhances performance on downstream tasks but also offers increased tuning flexibility in both input and output spaces. This facilitates more effective adaptation to downstream tasks in a few-shot learning setting. Experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation. On generalization, CoPrompt improves the state-of-the-art on zero-shot tasks and the overall harmonic mean over 11 datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. We make our code available at https://github.com/ShuvenduRoy/CoPrompt. \ No newline at end of file diff --git a/data/2024/iclr/Consistent Multi-Class Classification from Multiple Unlabeled Datasets b/data/2024/iclr/Consistent Multi-Class Classification from Multiple Unlabeled Datasets new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset b/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset new file mode 100644 index 0000000000..030cf9bcea --- /dev/null +++ b/data/2024/iclr/Consistent Video-to-Video Transfer Using Synthetic Dataset @@ -0,0 +1 @@ +We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment. \ No newline at end of file diff --git a/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics b/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics new file mode 100644 index 0000000000..654d633dfd --- /dev/null +++ b/data/2024/iclr/Consistent algorithms for multi-label classification with macro-at-k metrics @@ -0,0 +1 @@ +We consider the optimization of complex performance metrics in multi-label classification under the population utility framework. We mainly focus on metrics linearly decomposable into a sum of binary classification utilities applied separately to each label with an additional requirement of exactly $k$ labels predicted for each instance. These"macro-at-$k$"metrics possess desired properties for extreme classification problems with long tail labels. Unfortunately, the at-$k$ constraint couples the otherwise independent binary classification tasks, leading to a much more challenging optimization problem than standard macro-averages. We provide a statistical framework to study this problem, prove the existence and the form of the optimal classifier, and propose a statistically consistent and practical learning algorithm based on the Frank-Wolfe method. Interestingly, our main results concern even more general metrics being non-linear functions of label-wise confusion matrices. Empirical results provide evidence for the competitive performance of the proposed approach. \ No newline at end of file diff --git "a/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video" "b/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video" new file mode 100644 index 0000000000..005f1f4010 --- /dev/null +++ "b/data/2024/iclr/Consistent4D: Consistent 360\302\260 Dynamic Object Generation from Monocular Video" @@ -0,0 +1 @@ +In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is https://consistent4d.github.io/. \ No newline at end of file diff --git a/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm b/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm new file mode 100644 index 0000000000..edeceacdfe --- /dev/null +++ b/data/2024/iclr/Constrained Bi-Level Optimization: Proximal Lagrangian Value Function Approach and Hessian-free Algorithm @@ -0,0 +1 @@ +This paper presents a new approach and algorithm for solving a class of constrained Bi-Level Optimization (BLO) problems in which the lower-level problem involves constraints coupling both upper-level and lower-level variables. Such problems have recently gained significant attention due to their broad applicability in machine learning. However, conventional gradient-based methods unavoidably rely on computationally intensive calculations related to the Hessian matrix. To address this challenge, we begin by devising a smooth proximal Lagrangian value function to handle the constrained lower-level problem. Utilizing this construct, we introduce a single-level reformulation for constrained BLOs that transforms the original BLO problem into an equivalent optimization problem with smooth constraints. Enabled by this reformulation, we develop a Hessian-free gradient-based algorithm-termed proximal Lagrangian Value function-based Hessian-free Bi-level Algorithm (LV-HBA)-that is straightforward to implement in a single loop manner. Consequently, LV-HBA is especially well-suited for machine learning applications. Furthermore, we offer non-asymptotic convergence analysis for LV-HBA, eliminating the need for traditional strong convexity assumptions for the lower-level problem while also being capable of accommodating non-singleton scenarios. Empirical results substantiate the algorithm's superior practical performance. \ No newline at end of file diff --git a/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection b/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection new file mode 100644 index 0000000000..7bfb85b31b --- /dev/null +++ b/data/2024/iclr/Constrained Decoding for Cross-lingual Label Projection @@ -0,0 +1 @@ +Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment. \ No newline at end of file diff --git a/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations b/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations new file mode 100644 index 0000000000..2b47ed27c1 --- /dev/null +++ b/data/2024/iclr/Constraint-Free Structure Learning with Smooth Acyclic Orientations @@ -0,0 +1 @@ +The structure learning problem consists of fitting data generated by a Directed Acyclic Graph (DAG) to correctly reconstruct its arcs. In this context, differentiable approaches constrain or regularize the optimization problem using a continuous relaxation of the acyclicity property. The computational cost of evaluating graph acyclicity is cubic on the number of nodes and significantly affects scalability. In this paper we introduce COSMO, a constraint-free continuous optimization scheme for acyclic structure learning. At the core of our method, we define a differentiable approximation of an orientation matrix parameterized by a single priority vector. Differently from previous work, our parameterization fits a smooth orientation matrix and the resulting acyclic adjacency matrix without evaluating acyclicity at any step. Despite the absence of explicit constraints, we prove that COSMO always converges to an acyclic solution. In addition to being asymptotically faster, our empirical analysis highlights how COSMO performance on graph reconstruction compares favorably with competing structure learning methods. \ No newline at end of file diff --git a/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit b/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit new file mode 100644 index 0000000000..9a7f31bf82 --- /dev/null +++ b/data/2024/iclr/Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit @@ -0,0 +1 @@ +Vertical federated learning (VFL), where each participating client holds a subset of data features, has found numerous applications in finance, healthcare, and IoT systems. However, adversarial attacks, particularly through the injection of adversarial examples (AEs), pose serious challenges to the security of VFL models. In this paper, we investigate such vulnerabilities through developing a novel attack to disrupt the VFL inference process, under a practical scenario where the adversary is able to adaptively corrupt a subset of clients. We formulate the problem of finding optimal attack strategies as an online optimization problem, which is decomposed into an inner problem of adversarial example generation (AEG) and an outer problem of corruption pattern selection (CPS). Specifically, we establish the equivalence between the formulated CPS problem and a multi-armed bandit (MAB) problem, and propose the Thompson sampling with Empirical maximum reward (E-TS) algorithm for the adversary to efficiently identify the optimal subset of clients for corruption. The key idea of E-TS is to introduce an estimation of the expected maximum reward for each arm, which helps to specify a small set of competitive arms, on which the exploration for the optimal arm is performed. This significantly reduces the exploration space, which otherwise can quickly become prohibitively large as the number of clients increases. We analytically characterize the regret bound of E-TS, and empirically demonstrate its capability of efficiently revealing the optimal corruption pattern with the highest attack success rate, under various datasets of popular VFL tasks. \ No newline at end of file diff --git a/data/2024/iclr/Context is Environment b/data/2024/iclr/Context is Environment new file mode 100644 index 0000000000..75e21eed9a --- /dev/null +++ b/data/2024/iclr/Context is Environment @@ -0,0 +1 @@ +Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the bitter lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}$unlabeled examples as they arrive$\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization. \ No newline at end of file diff --git a/data/2024/iclr/Context-Aware Meta-Learning b/data/2024/iclr/Context-Aware Meta-Learning new file mode 100644 index 0000000000..5926361f72 --- /dev/null +++ b/data/2024/iclr/Context-Aware Meta-Learning @@ -0,0 +1 @@ +Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML. \ No newline at end of file diff --git a/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation b/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation new file mode 100644 index 0000000000..efa958732d --- /dev/null +++ b/data/2024/iclr/ContextRef: Evaluating Referenceless Metrics for Image Description Generation @@ -0,0 +1 @@ +Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence. \ No newline at end of file diff --git a/data/2024/iclr/Contextual Bandits with Online Neural Regression b/data/2024/iclr/Contextual Bandits with Online Neural Regression new file mode 100644 index 0000000000..97501049ef --- /dev/null +++ b/data/2024/iclr/Contextual Bandits with Online Neural Regression @@ -0,0 +1 @@ +Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a ${\mathcal{O}}(\sqrt{T})$ regret for online regression with square loss, which via the reduction implies a ${\mathcal{O}}(\sqrt{K} T^{3/4})$ regret for NeuCBs. Departing from this standard approach, we first show a $\mathcal{O}(\log T)$ regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a ${\mathcal{O}}(\log T)$ regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to $\tilde{\mathcal{O}}(\sqrt{KT})$ and $\tilde{\mathcal{O}}(\sqrt{KL^*} + K)$ regret for NeuCB, where $L^*$ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are $\Omega(T)$ or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms. \ No newline at end of file diff --git a/data/2024/iclr/Continual Learning in the Presence of Spurious Correlations: Analyses and a Simple Baseline b/data/2024/iclr/Continual Learning in the Presence of Spurious Correlations: Analyses and a Simple Baseline new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation b/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation new file mode 100644 index 0000000000..18d0b542c0 --- /dev/null +++ b/data/2024/iclr/Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation @@ -0,0 +1 @@ +We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and insufficient computational budget are the two main culprits for such a poor performance. Our new setting encourages learning methods to effectively and efficiently utilize the unlabeled data during training. To that end, we propose a simple but highly effective baseline, DietCL, which utilizes both unlabeled and labeled data jointly. DietCL meticulously allocates computational budget for both types of data. We validate our baseline, at scale, on several datasets, e.g., CLOC, ImageNet10K, and CGLM, under constraint budget setups. DietCL outperforms, by a large margin, all existing supervised CL algorithms as well as more recent continual semi-supervised methods. Our extensive analysis and ablations demonstrate that DietCL is stable under a full spectrum of label sparsity, computational budget, and various other ablations. \ No newline at end of file diff --git a/data/2024/iclr/Continual Momentum Filtering on Parameter Space for Online Test-time Adaptation b/data/2024/iclr/Continual Momentum Filtering on Parameter Space for Online Test-time Adaptation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks b/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks new file mode 100644 index 0000000000..717addea07 --- /dev/null +++ b/data/2024/iclr/Continuous Field Reconstruction from Sparse Observations with Implicit Neural Networks @@ -0,0 +1 @@ +Reliably reconstructing physical fields from sparse sensor data is a challenge that frequently arises in many scientific domains. In practice, the process generating the data often is not understood to sufficient accuracy. Therefore, there is a growing interest in using the deep neural network route to address the problem. This work presents a novel approach that learns a continuous representation of the physical field using implicit neural representations (INRs). Specifically, after factorizing spatiotemporal variability into spatial and temporal components using the separation of variables technique, the method learns relevant basis functions from sparsely sampled irregular data points to develop a continuous representation of the data. In experimental evaluations, the proposed model outperforms recent INR methods, offering superior reconstruction quality on simulation data from a state-of-the-art climate model and a second dataset that comprises ultra-high resolution satellite-based sea surface temperature fields. \ No newline at end of file diff --git a/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach b/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach new file mode 100644 index 0000000000..ed0b47afc8 --- /dev/null +++ b/data/2024/iclr/Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach @@ -0,0 +1 @@ +Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for $N$ times to obtain a final multiple which is $N$ times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and \textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method. \ No newline at end of file diff --git a/data/2024/iclr/Contrastive Difference Predictive Coding b/data/2024/iclr/Contrastive Difference Predictive Coding new file mode 100644 index 0000000000..891051fef5 --- /dev/null +++ b/data/2024/iclr/Contrastive Difference Predictive Coding @@ -0,0 +1 @@ +Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves $2 \times$ median improvement in success rates and can better cope with stochastic environments. In tabular settings, we show that our method is about $20 \times$ more sample efficient than the successor representation and $1500 \times$ more sample efficient than the standard (Monte Carlo) version of contrastive predictive coding. \ No newline at end of file diff --git a/data/2024/iclr/Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning b/data/2024/iclr/Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation b/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation new file mode 100644 index 0000000000..91719e5402 --- /dev/null +++ b/data/2024/iclr/ControlVideo: Training-free Controllable Text-to-video Generation @@ -0,0 +1 @@ +Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo. \ No newline at end of file diff --git a/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic b/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic new file mode 100644 index 0000000000..188d48766c --- /dev/null +++ b/data/2024/iclr/Controlled Text Generation via Language Model Arithmetic @@ -0,0 +1 @@ +As Large Language Models (LLMs) are deployed more widely, customization with respect to vocabulary, style, and character becomes more important. In this work, we introduce model arithmetic, a novel inference framework for composing and biasing LLMs without the need for model (re)training or highly specific datasets. In addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (CTG) techniques. Using model arithmetic, we can express prior CTG techniques as simple formulas and naturally extend them to new and more effective formulations. Further, we show that speculative sampling, a technique for efficient LLM sampling, extends to our setting. This enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. Our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction. We release an open source easy-to-use implementation of our framework at https://github.com/eth-sri/language-model-arithmetic. \ No newline at end of file diff --git a/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration b/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration new file mode 100644 index 0000000000..004e693293 --- /dev/null +++ b/data/2024/iclr/Controlling Vision-Language Models for Multi-Task Image Restoration @@ -0,0 +1 @@ +Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir. \ No newline at end of file diff --git a/data/2024/iclr/Convergence of Bayesian Bilevel Optimization b/data/2024/iclr/Convergence of Bayesian Bilevel Optimization new file mode 100644 index 0000000000..a0e0e9382f --- /dev/null +++ b/data/2024/iclr/Convergence of Bayesian Bilevel Optimization @@ -0,0 +1 @@ +This paper presents the first theoretical guarantee for Bayesian bilevel optimization (BBO) that we term for the prevalent bilevel framework combining Bayesian optimization at the outer level to tune hyperparameters, and the inner-level stochastic gradient descent (SGD) for training the model. We prove sublinear regret bounds suggesting simultaneous convergence of the inner-level model parameters and outer-level hyperparameters to optimal configurations for generalization capability. A pivotal, technical novelty in the proofs is modeling the excess risk of the SGD-trained parameters as evaluation noise during Bayesian optimization. Our theory implies the inner unit horizon, defined as the number of SGD iterations, shapes the convergence behavior of BBO. This suggests practical guidance on configuring the inner unit horizon to enhance training efficiency and model performance. \ No newline at end of file diff --git a/data/2024/iclr/Conversational Drug Editing Using Retrieval and Domain Feedback b/data/2024/iclr/Conversational Drug Editing Using Retrieval and Domain Feedback new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model b/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model new file mode 100644 index 0000000000..230154b099 --- /dev/null +++ b/data/2024/iclr/Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model @@ -0,0 +1 @@ +The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM's local prior assumption. Notably, Conv-LoRA not only preserves SAM's extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM's foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA's superiority in adapting SAM to real-world semantic segmentation tasks. \ No newline at end of file diff --git a/data/2024/iclr/Convolutional Deep Kernel Machines b/data/2024/iclr/Convolutional Deep Kernel Machines new file mode 100644 index 0000000000..11f7665127 --- /dev/null +++ b/data/2024/iclr/Convolutional Deep Kernel Machines @@ -0,0 +1 @@ +Standard infinite-width limits of neural networks sacrifice the ability for intermediate layers to learn representations from data. Recent work (A theory of representation learning gives a deep generalisation of kernel methods, Yang et al. 2023) modified the Neural Network Gaussian Process (NNGP) limit of Bayesian neural networks so that representation learning is retained. Furthermore, they found that applying this modified limit to a deep Gaussian process gives a practical learning algorithm which they dubbed the deep kernel machine (DKM). However, they only considered the simplest possible setting: regression in small, fully connected networks with e.g. 10 input features. Here, we introduce convolutional deep kernel machines. This required us to develop a novel inter-domain inducing point approximation, as well as introducing and experimentally assessing a number of techniques not previously seen in DKMs, including analogues to batch normalisation, different likelihoods, and different types of top-layer. The resulting model trains in roughly 77 GPU hours, achieving around 99% test accuracy on MNIST, 72% on CIFAR-100, and 92.7% on CIFAR-10, which is SOTA for kernel methods. \ No newline at end of file diff --git a/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields b/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields new file mode 100644 index 0000000000..20032dc91d --- /dev/null +++ b/data/2024/iclr/Coordinate-Aware Modulation for Neural Fields @@ -0,0 +1 @@ +Neural fields, mapping low-dimensional input coordinates to corresponding signals, have shown promising results in representing various signals. Numerous methodologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow compact and high expressibility, yet often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, however, at the expense of high spatial complexity. In this work, we propose a novel way for exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural filed literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin. \ No newline at end of file diff --git a/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion b/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion new file mode 100644 index 0000000000..977780485d --- /dev/null +++ b/data/2024/iclr/Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion @@ -0,0 +1 @@ +Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics. \ No newline at end of file diff --git a/data/2024/iclr/Copula Conformal prediction for multi-step time series prediction b/data/2024/iclr/Copula Conformal prediction for multi-step time series prediction new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning b/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning new file mode 100644 index 0000000000..a969fe12f0 --- /dev/null +++ b/data/2024/iclr/Correlated Noise Provably Beats Independent Noise for Differentially Private Learning @@ -0,0 +1 @@ +Differentially private learning algorithms inject noise into the learning process. While the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation matrix in previous work. We validate our theory with experiments on private deep learning. Our work matches or outperforms prior work while being efficient both in terms of compute and memory. \ No newline at end of file diff --git a/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies b/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies new file mode 100644 index 0000000000..c9e60609e0 --- /dev/null +++ b/data/2024/iclr/Counterfactual Density Estimation using Kernel Stein Discrepancies @@ -0,0 +1 @@ +Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance. \ No newline at end of file diff --git a/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks b/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks new file mode 100644 index 0000000000..7c92af98c2 --- /dev/null +++ b/data/2024/iclr/Counting Graph Substructures with Graph Neural Networks @@ -0,0 +1 @@ +Graph Neural Networks (GNNs) are powerful representation learning tools that have achieved remarkable performance in various downstream tasks. However, there are still open questions regarding their ability to count and list substructures, which play a crucial role in biological and social networks. In this work, we fill this gap and characterize the representation and generalization power of GNNs in terms of their ability to produce powerful representations that count substructures. In particular, we study the message-passing operations of GNNs with random node input in a novel fashion, and show how they can produce equivariant representations that are associated with high-order statistical moments. Using these representations, we prove that GNNs can learn how to count cycles, cliques, quasi-cliques, and the number of connected components in a graph. We also provide new insights into the generalization capacity of GNNs. Our analysis is constructive and enables the design of a generic GNN architecture that shows remarkable performance in four distinct tasks: cycle detection, cycle counting, graph classification, and molecular property prediction. \ No newline at end of file diff --git a/data/2024/iclr/Course Correcting Koopman Representations b/data/2024/iclr/Course Correcting Koopman Representations new file mode 100644 index 0000000000..7b08d6d02c --- /dev/null +++ b/data/2024/iclr/Course Correcting Koopman Representations @@ -0,0 +1 @@ +Koopman representations aim to learn features of nonlinear dynamical systems (NLDS) which lead to linear dynamics in the latent space. Theoretically, such features can be used to simplify many problems in modeling and control of NLDS. In this work we study autoencoder formulations of this problem, and different ways they can be used to model dynamics, specifically for future state prediction over long horizons. We discover several limitations of predicting future states in the latent space and propose an inference-time mechanism, which we refer to as Periodic Reencoding, for faithfully capturing long term dynamics. We justify this method both analytically and empirically via experiments in low and high dimensional NLDS. \ No newline at end of file diff --git a/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping b/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping new file mode 100644 index 0000000000..2da2550f38 --- /dev/null +++ b/data/2024/iclr/CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping @@ -0,0 +1 @@ +Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo. \ No newline at end of file diff --git a/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks b/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks new file mode 100644 index 0000000000..05b8f0a09a --- /dev/null +++ b/data/2024/iclr/Critical Learning Periods Emerge Even in Deep Linear Networks @@ -0,0 +1 @@ +Critical learning periods are periods early in development where temporary sensory deficits can have a permanent effect on behavior and learned representations. Despite the radical differences between biological and artificial networks, critical learning periods have been empirically observed in both systems. This suggests that critical periods may be fundamental to learning and not an accident of biology. Yet, why exactly critical periods emerge in deep networks is still an open question, and in particular it is unclear whether the critical periods observed in both systems depend on particular architectural or optimization details. To isolate the key underlying factors, we focus on deep linear network models, and show that, surprisingly, such networks also display much of the behavior seen in biology and artificial networks, while being amenable to analytical treatment. We show that critical periods depend on the depth of the model and structure of the data distribution. We also show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks. \ No newline at end of file diff --git a/data/2024/iclr/Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing b/data/2024/iclr/Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning b/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning new file mode 100644 index 0000000000..10d7448447 --- /dev/null +++ b/data/2024/iclr/CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning @@ -0,0 +1 @@ +Human motion driven control (HMDC) is an effective approach for generating natural and compelling robot motions while preserving high-level semantics. However, establishing the correspondence between humans and robots with different body structures is not straightforward due to the mismatches in kinematics and dynamics properties, which causes intrinsic ambiguity to the problem. Many previous algorithms approach this motion retargeting problem with unsupervised learning, which requires the prerequisite skill sets. However, it will be extremely costly to learn all the skills without understanding the given human motions, particularly for high-dimensional robots. In this work, we introduce CrossLoco, a guided unsupervised reinforcement learning framework that simultaneously learns robot skills and their correspondence to human motions. Our key innovation is to introduce a cycle-consistency-based reward term designed to maximize the mutual information between human motions and robot states. We demonstrate that the proposed framework can generate compelling robot motions by translating diverse human motions, such as running, hopping, and dancing. We quantitatively compare our CrossLoco against the manually engineered and unsupervised baseline algorithms along with the ablated versions of our framework and demonstrate that our method translates human motions with better accuracy, diversity, and user preference. We also showcase its utility in other applications, such as synthesizing robot movements from language input and enabling interactive robot control. \ No newline at end of file diff --git a/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity b/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity new file mode 100644 index 0000000000..38e8ddc25a --- /dev/null +++ b/data/2024/iclr/CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity @@ -0,0 +1 @@ +Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC. \ No newline at end of file diff --git a/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding b/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding new file mode 100644 index 0000000000..f126c46b7c --- /dev/null +++ b/data/2024/iclr/Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding @@ -0,0 +1 @@ +Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets. \ No newline at end of file diff --git a/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models b/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models new file mode 100644 index 0000000000..72cd4e3a47 --- /dev/null +++ b/data/2024/iclr/Curiosity-driven Red-teaming for Large Language Models @@ -0,0 +1 @@ +Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam} \ No newline at end of file diff --git a/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning b/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning new file mode 100644 index 0000000000..6963c36e13 --- /dev/null +++ b/data/2024/iclr/Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning @@ -0,0 +1 @@ +Modular and composable transfer learning is an emerging direction in the field of Parameter Efficient Fine-Tuning, as it enables neural networks to better organize various aspects of knowledge, leading to improved cross-task generalization. In this paper, we introduce a novel approach Customized Polytropon C-Poly that combines task-common skills and task-specific skills, while the skill parameters being highly parameterized using low-rank techniques. Each task is associated with a customizable number of exclusive specialized skills and also benefits from skills shared with peer tasks. A skill assignment matrix is jointly learned. To evaluate our approach, we conducted extensive experiments on the Super-NaturalInstructions and the SuperGLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios. \ No newline at end of file diff --git a/data/2024/iclr/Cycle Consistency Driven Object Discovery b/data/2024/iclr/Cycle Consistency Driven Object Discovery new file mode 100644 index 0000000000..5296dee4ac --- /dev/null +++ b/data/2024/iclr/Cycle Consistency Driven Object Discovery @@ -0,0 +1 @@ +Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called ``slots'' or ``object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks. \ No newline at end of file diff --git a/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning b/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning new file mode 100644 index 0000000000..3ed9fc90cc --- /dev/null +++ b/data/2024/iclr/D2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning @@ -0,0 +1 @@ +Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models. \ No newline at end of file diff --git a/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training b/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training new file mode 100644 index 0000000000..da34efdde7 --- /dev/null +++ b/data/2024/iclr/DAFA: Distance-Aware Fair Adversarial Training @@ -0,0 +1 @@ +The disparity in accuracy between classes in standard training is amplified during adversarial training, a phenomenon termed the robust fairness problem. Existing methodologies aimed to enhance robust fairness by sacrificing the model's performance on easier classes in order to improve its performance on harder ones. However, we observe that under adversarial attacks, the majority of the model's predictions for samples from the worst class are biased towards classes similar to the worst class, rather than towards the easy classes. Through theoretical and empirical analysis, we demonstrate that robust fairness deteriorates as the distance between classes decreases. Motivated by these insights, we introduce the Distance-Aware Fair Adversarial training (DAFA) methodology, which addresses robust fairness by taking into account the similarities between classes. Specifically, our method assigns distinct loss weights and adversarial margins to each class and adjusts them to encourage a trade-off in robustness among similar classes. Experimental results across various datasets demonstrate that our method not only maintains average robust accuracy but also significantly improves the worst robust accuracy, indicating a marked improvement in robust fairness compared to existing methods. \ No newline at end of file diff --git a/data/2024/iclr/DAM: Towards a Foundation Model for Forecasting b/data/2024/iclr/DAM: Towards a Foundation Model for Forecasting new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks b/data/2024/iclr/DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations b/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations new file mode 100644 index 0000000000..12e220a980 --- /dev/null +++ b/data/2024/iclr/DDMI: Domain-agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations @@ -0,0 +1 @@ +Recent studies have introduced a new class of generative models for synthesizing implicit neural representations (INRs) that capture arbitrary continuous signals in various domains. These models opened the door for domain-agnostic generative models, but they often fail to achieve high-quality generation. We observed that the existing methods generate the weights of neural networks to parameterize INRs and evaluate the network with fixed positional embeddings (PEs). Arguably, this architecture limits the expressive power of generative models and results in low-quality INR generation. To address this limitation, we propose Domain-agnostic Latent Diffusion Model for INRs (DDMI) that generates adaptive positional embeddings instead of neural networks' weights. Specifically, we develop a Discrete-to-continuous space Variational AutoEncoder (D2C-VAE), which seamlessly connects discrete data and the continuous signal functions in the shared latent space. Additionally, we introduce a novel conditioning mechanism for evaluating INRs with the hierarchically decomposed PEs to further enhance expressive power. Extensive experiments across four modalities, e.g., 2D images, 3D shapes, Neural Radiance Fields, and videos, with seven benchmark datasets, demonstrate the versatility of DDMI and its superior performance compared to the existing INR generative models. \ No newline at end of file diff --git a/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation b/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation new file mode 100644 index 0000000000..1b2ed0cd74 --- /dev/null +++ b/data/2024/iclr/DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation @@ -0,0 +1 @@ +We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer. \ No newline at end of file diff --git a/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models b/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models new file mode 100644 index 0000000000..69bf192cb7 --- /dev/null +++ b/data/2024/iclr/DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models @@ -0,0 +1 @@ +Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized usage of data during the training process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission from the artist. To address this issue, it becomes crucial to detect unauthorized data usage. In this paper, we propose a method for detecting such unauthorized data usage by planting injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected image dataset by adding unique contents on the images such as stealthy image wrapping functions that are imperceptible to human vision but can be captured and memorized by diffusion models. By analyzing whether the model has memorization for the injected content (i.e., whether the generated images are processed by the chosen post-processing function), we can detect models that had illegally utilized the unauthorized data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate the effectiveness of the proposed method in detecting unauthorized data usages. \ No newline at end of file diff --git a/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation b/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation new file mode 100644 index 0000000000..af761cb1ba --- /dev/null +++ b/data/2024/iclr/DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation @@ -0,0 +1 @@ +We introduce DIFFTACTILE, a physics-based differentiable tactile simulation system designed to enhance robotic manipulation with dense and physically accurate tactile feedback. In contrast to prior tactile simulators which primarily focus on manipulating rigid bodies and often rely on simplified approximations to model stress and deformations of materials in contact, DIFFTACTILE emphasizes physics-based contact modeling with high fidelity, supporting simulations of diverse contact modes and interactions with objects possessing a wide range of material properties. Our system incorporates several key components, including a Finite Element Method (FEM)-based soft body model for simulating the sensing elastomer, a multi-material simulator for modeling diverse object types (such as elastic, elastoplastic, cables) under manipulation, a penalty-based contact model for handling contact dynamics. The differentiable nature of our system facilitates gradient-based optimization for both 1) refining physical properties in simulation using real-world data, hence narrowing the sim-to-real gap and 2) efficient learning of tactile-assisted grasping and contact-rich manipulation skills. Additionally, we introduce a method to infer the optical response of our tactile sensor to contact using an efficient pixel-based neural module. We anticipate that DIFFTACTILE will serve as a useful platform for studying contact-rich manipulations, leveraging the benefits of dense tactile feedback and differentiable physics. Code and supplementary materials are available at the project website https://difftactile.github.io/. \ No newline at end of file diff --git a/data/2024/iclr/DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbations b/data/2024/iclr/DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbations new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model b/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model new file mode 100644 index 0000000000..dadce75001 --- /dev/null +++ b/data/2024/iclr/DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model @@ -0,0 +1 @@ +We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train \textbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ . \ No newline at end of file diff --git a/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text b/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text new file mode 100644 index 0000000000..b67c7b9cf4 --- /dev/null +++ b/data/2024/iclr/DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text @@ -0,0 +1 @@ +Large language models (LLMs) have notably enhanced the fluency and diversity of machine-generated text. However, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of LLMs. Conventional training-based methods have limitations in flexibility, particularly when adapting to new domains, and they often lack explanatory power. To address this gap, we propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT). Given a text, we first truncate it in the middle and then use only the preceding portion as input to the LLMs to regenerate the new remaining parts. By analyzing the differences between the original and new remaining parts through N-gram analysis in black-box or probability divergence in white-box, we unveil significant discrepancies between the distribution of machine-generated text and the distribution of human-written text. We conducted extensive experiments on the most advanced LLMs from OpenAI, including text-davinci-003, GPT-3.5-turbo, and GPT-4, as well as open-source models such as GPT-NeoX-20B and LLaMa-13B. Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text on four English and one German dataset, outperforming OpenAI's own classifier, which is trained on millions of text. Additionally, our methods provide reasonable explanations and evidence to support our claim, which is a unique feature of explainable detection. Our method is also robust under the revised text attack and can additionally solve model sourcing. Codes are available at https://github.com/Xianjun-Yang/DNA-GPT. \ No newline at end of file diff --git a/data/2024/iclr/DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes b/data/2024/iclr/DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al b/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al new file mode 100644 index 0000000000..ea33d1a989 --- /dev/null +++ b/data/2024/iclr/DORSal: Diffusion for Object-centric Representations of Scenes et al @@ -0,0 +1 @@ +Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches. \ No newline at end of file diff --git a/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection b/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection new file mode 100644 index 0000000000..3cc7f9cd20 --- /dev/null +++ b/data/2024/iclr/DOS: Diverse Outlier Sampling for Out-of-Distribution Detection @@ -0,0 +1 @@ +Modern neural networks are known to give overconfident prediction for out-of-distribution inputs when deployed in the open world. It is common practice to leverage a surrogate outlier dataset to regularize the model during training, and recent studies emphasize the role of uncertainty in designing the sampling strategy for outlier dataset. However, the OOD samples selected solely based on predictive uncertainty can be biased towards certain types, which may fail to capture the full outlier distribution. In this work, we empirically show that diversity is critical in sampling outliers for OOD detection performance. Motivated by the observation, we propose a straightforward and novel sampling strategy named DOS (Diverse Outlier Sampling) to select diverse and informative outliers. Specifically, we cluster the normalized features at each iteration, and the most informative outlier from each cluster is selected for model training with absent category loss. With DOS, the sampled outliers efficiently shape a globally compact decision boundary between ID and OOD data. Extensive experiments demonstrate the superiority of DOS, reducing the average FPR95 by up to 25.79% on CIFAR-100 with TI-300K. \ No newline at end of file diff --git a/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer b/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer new file mode 100644 index 0000000000..8cc7b9c1b4 --- /dev/null +++ b/data/2024/iclr/DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer @@ -0,0 +1 @@ +Large Language Models (LLMs) have emerged as dominant tools for various tasks, particularly when tailored for a specific target by prompt tuning. Nevertheless, concerns surrounding data privacy present obstacles due to the tuned prompts' dependency on sensitive private information. A practical solution is to host a local LLM and optimize a soft prompt privately using data. Yet, hosting a local model becomes problematic when model ownership is protected. Alternative methods, like sending data to the model's provider for training, intensify these privacy issues facing an untrusted provider. In this paper, we present a novel solution called Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge. Our approach involves tuning a discrete prompt on the client side and then applying it to the desired cloud models. We demonstrate that prompts suggested by LLMs themselves can be transferred without compromising performance significantly. To ensure that the prompts do not leak private information, we introduce the first private prompt generation mechanism, by a differentially-private (DP) ensemble of in-context learning with private demonstrations. With DP-OPT, generating privacy-preserving prompts by Vicuna-7b can yield competitive performance compared to non-private in-context learning on GPT3.5 or local private prompt tuning. Codes are available at https://github.com/VITA-Group/DP-OPT . \ No newline at end of file diff --git a/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way b/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way new file mode 100644 index 0000000000..af7a06f099 --- /dev/null +++ b/data/2024/iclr/DP-SGD Without Clipping: The Lipschitz Neural Network Way @@ -0,0 +1 @@ +State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN) face difficulties to estimate tight bounds on the sensitivity of the network's layers, and instead rely on a process of per-sample gradient clipping. This clipping process not only biases the direction of gradients but also proves costly both in memory consumption and in computation. To provide sensitivity bounds and bypass the drawbacks of the clipping process, we propose to rely on Lipschitz constrained networks. Our theoretical analysis reveals an unexplored link between the Lipschitz constant with respect to their input and the one with respect to their parameters. By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees. Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees. The code has been released as a Python package available at https://github.com/Algue-Rythme/lip-dp \ No newline at end of file diff --git a/data/2024/iclr/DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning b/data/2024/iclr/DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DREAM: Dual Structured Exploration with Mixup for Open-set Graph Domain Adaption b/data/2024/iclr/DREAM: Dual Structured Exploration with Mixup for Open-set Graph Domain Adaption new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness b/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness new file mode 100644 index 0000000000..0ab8d4f5de --- /dev/null +++ b/data/2024/iclr/DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness @@ -0,0 +1 @@ +Machine Learning (ML) models have been utilized for malware detection for over two decades. Consequently, this ignited an ongoing arms race between malware authors and antivirus systems, compelling researchers to propose defenses for malware-detection models against evasion attacks. However, most if not all existing defenses against evasion attacks suffer from sizable performance degradation and/or can defend against only specific attacks, which makes them less practical in real-world settings. In this work, we develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. After showing how DRSM is theoretically robust against attacks with contiguous adversarial bytes, we verify its performance and certified robustness experimentally, where we observe only marginal accuracy drops as the cost of robustness. To our knowledge, we are the first to offer certified robustness in the realm of static detection of malware executables. More surprisingly, through evaluating DRSM against 9 empirical attacks of different types, we observe that the proposed defense is empirically robust to some extent against a diverse set of attacks, some of which even fall out of the scope of its original threat model. In addition, we collected 15.5K recent benign raw executables from diverse sources, which will be made public as a dataset called PACE (Publicly Accessible Collection(s) of Executables) to alleviate the scarcity of publicly available benign datasets for studying malware detection and provide future research with more representative data of the time. \ No newline at end of file diff --git a/data/2024/iclr/DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines b/data/2024/iclr/DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation b/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation new file mode 100644 index 0000000000..e33573f777 --- /dev/null +++ b/data/2024/iclr/DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation @@ -0,0 +1 @@ +Accurate 3D lane estimation is crucial for ensuring safety in autonomous driving. However, prevailing monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. In contrast, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. We propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively leveraging the modal-specific information. To achieve this, we introduce three designs: 1) A bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting their unique strengths. 2) A unified query generation approach that leverages lane-aware knowledge from both PV and BEV spaces to generate queries. 3) A 3D dual-view deformable attention mechanism, which aggregates discriminative features from both PV and BEV spaces into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane. It achieves state-of-the-art performance, with a remarkable 11.2 gain in F1 score and a substantial 53.5% reduction in errors. The code is available at \url{https://github.com/JMoonr/dv-3dlane}. \ No newline at end of file diff --git a/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines b/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines new file mode 100644 index 0000000000..7ff5d26ece --- /dev/null +++ b/data/2024/iclr/Data Debugging with Shapley Importance over Machine Learning Pipelines @@ -0,0 +1 @@ +When a machine learning (ML) model exhibits poor quality (e.g., poor accuracy or fairness), the problem can often be traced back to errors in the training data. Being able to discover the data examples that are the most likely culprits is a fundamental concern that has received a lot of attention recently. One prominent way to measure "data importance" with respect to model quality is the Shapley value. Unfortunately, existing methods only focus on the ML model in isolation, without considering the broader ML pipeline for data preparation and feature extraction, which appears in the majority of real-world ML code. This presents a major limitation to applying existing methods in practical settings. In this paper, we propose Datascope, a method for efficiently computing Shapley-based data importance over ML pipelines. We introduce several approximations that lead to dramatic improvements in terms of computational speed. Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at github.com/easeml/datascope. \ No newline at end of file diff --git a/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality b/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality new file mode 100644 index 0000000000..7e2a884aa3 --- /dev/null +++ b/data/2024/iclr/Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality @@ -0,0 +1 @@ +Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap when compared to training on the original data. In this work, we are the first to argue that using just one synthetic subset for distillation will not yield optimal generalization performance. This is because the training dynamics of deep networks drastically change during the training. Hence, multiple synthetic subsets are required to capture the training dynamics at different phases of training. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enable generating considerably larger synthetic datasets. \ No newline at end of file diff --git a/data/2024/iclr/Data Filtering Networks b/data/2024/iclr/Data Filtering Networks new file mode 100644 index 0000000000..0c00acb999 --- /dev/null +++ b/data/2024/iclr/Data Filtering Networks @@ -0,0 +1 @@ +Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data. \ No newline at end of file diff --git a/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models b/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models new file mode 100644 index 0000000000..71c38bc17a --- /dev/null +++ b/data/2024/iclr/DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models @@ -0,0 +1 @@ +Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled. \ No newline at end of file diff --git a/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation b/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation new file mode 100644 index 0000000000..0dfa0127a1 --- /dev/null +++ b/data/2024/iclr/Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation @@ -0,0 +1 @@ +Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions. \ No newline at end of file diff --git a/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks b/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks new file mode 100644 index 0000000000..b6c832647e --- /dev/null +++ b/data/2024/iclr/De novo Protein Design Using Geometric Vector Field Networks @@ -0,0 +1 @@ +Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far, only several simple encoders, such as IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we proffer the Vector Field Network (VFN), which enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits an impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding (frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate. We also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin. \ No newline at end of file diff --git a/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning b/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning new file mode 100644 index 0000000000..d7f9ae2c70 --- /dev/null +++ b/data/2024/iclr/DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning @@ -0,0 +1 @@ +Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes. \ No newline at end of file diff --git a/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing b/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing new file mode 100644 index 0000000000..178cf54298 --- /dev/null +++ b/data/2024/iclr/Debiased Collaborative Filtering with Kernel-Based Causal Balancing @@ -0,0 +1 @@ +Debiased collaborative filtering aims to learn an unbiased prediction model by removing different biases in observational datasets. To solve this problem, one of the simple and effective methods is based on the propensity score, which adjusts the observational sample distribution to the target one by reweighting observed instances. Ideally, propensity scores should be learned with causal balancing constraints. However, existing methods usually ignore such constraints or implement them with unreasonable approximations, which may affect the accuracy of the learned propensity scores. To bridge this gap, in this paper, we first analyze the gaps between the causal balancing requirements and existing methods such as learning the propensity with cross-entropy loss or manually selecting functions to balance. Inspired by these gaps, we propose to approximate the balancing functions in reproducing kernel Hilbert space and demonstrate that, based on the universal property and representer theorem of kernel functions, the causal balancing constraints can be better satisfied. Meanwhile, we propose an algorithm that adaptively balances the kernel function and theoretically analyze the generalization error bound of our methods. We conduct extensive experiments to demonstrate the effectiveness of our methods, and to promote this research direction, we have released our project at https://github.com/haoxuanli-pku/ICLR24-Kernel-Balancing. \ No newline at end of file diff --git a/data/2024/iclr/Debiasing Algorithm through Model Adaptation b/data/2024/iclr/Debiasing Algorithm through Model Adaptation new file mode 100644 index 0000000000..9a261e2147 --- /dev/null +++ b/data/2024/iclr/Debiasing Algorithm through Model Adaptation @@ -0,0 +1 @@ +Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased. \ No newline at end of file diff --git a/data/2024/iclr/Debiasing Attention Mechanism in Transformer without Demographics b/data/2024/iclr/Debiasing Attention Mechanism in Transformer without Demographics new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning b/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning new file mode 100644 index 0000000000..d6d155dc87 --- /dev/null +++ b/data/2024/iclr/Deceptive Fairness Attacks on Graphs via Meta Learning @@ -0,0 +1 @@ +We study deceptive fairness attacks on graphs to answer the following question: How can we achieve poisoning attacks on a graph learning model to exacerbate the bias deceptively? We answer this question via a bi-level optimization problem and propose a meta learning-based framework named FATE. FATE is broadly applicable with respect to various fairness definitions and graph learning models, as well as arbitrary choices of manipulation operations. We further instantiate FATE to attack statistical parity and individual fairness on graph neural networks. We conduct extensive experimental evaluations on real-world datasets in the task of semi-supervised node classification. The experimental results demonstrate that FATE could amplify the bias of graph neural networks with or without fairness consideration while maintaining the utility on the downstream task. We hope this paper provides insights into the adversarial robustness of fair graph learning and can shed light on designing robust and fair graph learning in future studies. \ No newline at end of file diff --git a/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making b/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making new file mode 100644 index 0000000000..bfa188ed23 --- /dev/null +++ b/data/2024/iclr/Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making @@ -0,0 +1 @@ +The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability. \ No newline at end of file diff --git a/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder b/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder new file mode 100644 index 0000000000..0aa07166bc --- /dev/null +++ b/data/2024/iclr/Decodable and Sample Invariant Continuous Object Encoder @@ -0,0 +1 @@ +We propose Hyper-Dimensional Function Encoding (HDFE). Given samples of a continuous object (e.g. a function), HDFE produces an explicit vector representation of the given object, invariant to the sample distribution and density. Sample distribution and density invariance enables HDFE to consistently encode continuous objects regardless of their sampling, and therefore allows neural networks to receive continuous objects as inputs for machine learning tasks, such as classification and regression. Besides, HDFE does not require any training and is proved to map the object into an organized embedding space, which facilitates the training of the downstream tasks. In addition, the encoding is decodable, which enables neural networks to regress continuous objects by regressing their encodings. Therefore, HDFE serves as an interface for processing continuous objects. We apply HDFE to function-to-function mapping, where vanilla HDFE achieves competitive performance as the state-of-the-art algorithm. We apply HDFE to point cloud surface normal estimation, where a simple replacement from PointNet to HDFE leads to immediate 12% and 15% error reductions in two benchmarks. In addition, by integrating HDFE into the PointNet-based SOTA network, we improve the SOTA baseline by 2.5% and 1.7% in the same benchmarks. \ No newline at end of file diff --git a/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition b/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition new file mode 100644 index 0000000000..7bbb560c90 --- /dev/null +++ b/data/2024/iclr/Decoding Natural Images from EEG for Object Recognition @@ -0,0 +1 @@ +Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. The code will be released on https://github.com/eeyhsong/NICE-EEG. \ No newline at end of file diff --git a/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization b/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization new file mode 100644 index 0000000000..3dff53428c --- /dev/null +++ b/data/2024/iclr/DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization @@ -0,0 +1 @@ +Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving \textit{de novo} design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both \textit{de novo} design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks. \ No newline at end of file diff --git a/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems b/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems new file mode 100644 index 0000000000..fb19d33417 --- /dev/null +++ b/data/2024/iclr/Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems @@ -0,0 +1 @@ +Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method. Code is available at https://github.com/HJ-harry/DDS \ No newline at end of file diff --git a/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces b/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces new file mode 100644 index 0000000000..07ad973e84 --- /dev/null +++ b/data/2024/iclr/Decongestion by Representation: Learning to Improve Economic Welfare in Marketplaces @@ -0,0 +1 @@ +Congestion is a common failure mode of markets, where consumers compete inefficiently on the same subset of goods (e.g., chasing the same small set of properties on a vacation rental platform). The typical economic story is that prices decongest by balancing supply and demand. But in modern online marketplaces, prices are typically set in a decentralized way by sellers, and the information about items is inevitably partial. The power of a platform is limited to controlling representations -- the subset of information about items presented by default to users. This motivates the present study of decongestion by representation, where a platform seeks to learn representations that reduce congestion and thus improve social welfare. The technical challenge is twofold: relying only on revealed preferences from the choices of consumers, rather than true preferences; and the combinatorial problem associated with representations that determine the features to reveal in the default view. We tackle both challenges by proposing a differentiable proxy of welfare that can be trained end-to-end on consumer choice data. We develop sufficient conditions for when decongestion promotes welfare, and present the results of extensive experiments on both synthetic and real data that demonstrate the utility of our approach. \ No newline at end of file diff --git a/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations b/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations new file mode 100644 index 0000000000..cedf0b6e4e --- /dev/null +++ b/data/2024/iclr/Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations @@ -0,0 +1 @@ +A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the observed events. While most previous studies focus on the inter-event dependencies and their representations, how individual events influence the overall dynamics over time has been under-explored. In this regime, we propose a Decoupled MTPP framework that disentangles characterization of a stochastic process into a set of evolving influences from different events. Our approach employs Neural Ordinary Differential Equations (Neural ODEs) to learn flexible continuous dynamics of these influences while simultaneously addressing multiple inference problems, such as density estimation and survival rate computation. We emphasize the significance of disentangling the influences by comparing our framework with state-of-the-art methods on real-life datasets, and provide analysis on the model behavior for potential applications. \ No newline at end of file diff --git a/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks b/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks new file mode 100644 index 0000000000..eb69cb55c6 --- /dev/null +++ b/data/2024/iclr/Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training Tasks @@ -0,0 +1 @@ +Recent years have witnessed the great success of graph pre-training for graph representation learning. With hundreds of graph pre-training tasks proposed, integrating knowledge acquired from multiple pre-training tasks has become a popular research topic. In this paper, we identify two important collaborative processes for this topic: (1) select: how to select an optimal task combination from a given task pool based on their compatibility, and (2) weigh: how to weigh the selected tasks based on their importance. While there currently has been a lot of work focused on weighing, comparatively little effort has been devoted to selecting. This paper proposes a novel instance-level framework for integrating multiple graph pre-training tasks, Weigh And Select (WAS), where the two collaborative processes, weighing and selecting, are combined by decoupled siamese networks. Specifically, it first adaptively learns an optimal combination of tasks for each instance from a given task pool, based on which a customized instance-level task weighing strategy is learned. Extensive experiments on 16 graph datasets across node-level and graph-level downstream tasks have demonstrated that by combining a few simple but classical tasks, WAS can achieve comparable performance to other leading counterparts. The code is available at https://github.com/TianyuFan0504/WAS. \ No newline at end of file diff --git a/data/2024/iclr/Decoupling regularization from the action space b/data/2024/iclr/Decoupling regularization from the action space new file mode 100644 index 0000000000..dcc59913a5 --- /dev/null +++ b/data/2024/iclr/Decoupling regularization from the action space @@ -0,0 +1 @@ +Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. While standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological sequence design task. \ No newline at end of file diff --git a/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization b/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization new file mode 100644 index 0000000000..0fc7d05e72 --- /dev/null +++ b/data/2024/iclr/Deep Confident Steps to New Pockets: Strategies for Docking Generalization @@ -0,0 +1 @@ +Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods. \ No newline at end of file diff --git a/data/2024/iclr/Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders b/data/2024/iclr/Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably b/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably new file mode 100644 index 0000000000..1215f29c0f --- /dev/null +++ b/data/2024/iclr/Deep Neural Networks Tend To Extrapolate Predictably @@ -0,0 +1 @@ +Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work reassesses this assumption for neural networks with high-dimensional inputs. Rather than extrapolating in arbitrary ways, we observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD. Moreover, we find that this value often closely approximates the optimal constant solution (OCS), i.e., the prediction that minimizes the average loss over the training data without observing the input. We present results showing this phenomenon across 8 datasets with different distributional shifts (including CIFAR10-C and ImageNet-R, S), different loss functions (cross entropy, MSE, and Gaussian NLL), and different architectures (CNNs and transformers). Furthermore, we present an explanation for this behavior, which we first validate empirically and then study theoretically in a simplified setting involving deep homogeneous networks with ReLU activations. Finally, we show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs. \ No newline at end of file diff --git a/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection b/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection new file mode 100644 index 0000000000..0995b81f7d --- /dev/null +++ b/data/2024/iclr/Deep Orthogonal Hypersphere Compression for Anomaly Detection @@ -0,0 +1 @@ +Many well-known and effective anomaly detection methods assume that a reasonable decision boundary has a hypersphere shape, which however is difficult to obtain in practice and is not sufficiently compact, especially when the data are in high-dimensional spaces. In this paper, we first propose a novel deep anomaly detection model that improves the original hypersphere learning through an orthogonal projection layer, which ensures that the training data distribution is consistent with the hypersphere hypothesis, thereby increasing the true positive rate and decreasing the false negative rate. Moreover, we propose a bi-hypersphere compression method to obtain a hyperspherical shell that yields a more compact decision region than a hyperball, which is demonstrated theoretically and numerically. The proposed methods are not confined to common datasets such as image and tabular data, but are also extended to a more challenging but promising scenario, graph-level anomaly detection, which learns graph representation with maximum mutual information between the substructure and global structure features while exploring orthogonal single- or bi-hypersphere anomaly decision boundaries. The numerical and visualization results on benchmark datasets demonstrate the superiority of our methods in comparison to many baselines and state-of-the-art methods. \ No newline at end of file diff --git a/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling b/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling new file mode 100644 index 0000000000..5773cc8f6f --- /dev/null +++ b/data/2024/iclr/Deep Reinforcement Learning Guided Improvement Heuristic for Job Shop Scheduling @@ -0,0 +1 @@ +Recent studies in using deep reinforcement learning (DRL) to solve Job-shop scheduling problems (JSSP) focus on construction heuristics. However, their performance is still far from optimality, mainly because the underlying graph representation scheme is unsuitable for modelling partial solutions at each construction step. This paper proposes a novel DRL-guided improvement heuristic for solving JSSP, where graph representation is employed to encode complete solutions. We design a Graph Neural-Network-based representation scheme, consisting of two modules to effectively capture the information of dynamic topology and different types of nodes in graphs encountered during the improvement process. To speed up solution evaluation during improvement, we present a novel message-passing mechanism that can evaluate multiple solutions simultaneously. We prove that the computational complexity of our method scales linearly with problem size. Experiments on classic benchmarks show that the improvement policy learned by our method outperforms state-of-the-art DRL-based methods by a large margin. \ No newline at end of file diff --git a/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes b/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes new file mode 100644 index 0000000000..22c7ca4f2c --- /dev/null +++ b/data/2024/iclr/Deep Reinforcement Learning for Modelling Protein Complexes @@ -0,0 +1 @@ +AlphaFold can be used for both single-chain and multi-chain protein structure prediction, while the latter becomes extremely challenging as the number of chains increases. In this work, by taking each chain as a node and assembly actions as edges, we show that an acyclic undirected connected graph can be used to predict the structure of multi-chain protein complexes (a.k.a., protein complex modelling, PCM). However, there are still two challenges: 1) The huge combinatorial optimization space of $N^{N-2}$ ($N$ is the number of chains) for the PCM problem can easily lead to high computational cost. 2) The scales of protein complexes exhibit distribution shift due to variance in chain numbers, which calls for the generalization in modelling complexes of various scales. To address these challenges, we propose GAPN, a Generative Adversarial Policy Network powered by domain-specific rewards and adversarial loss through policy gradient for automatic PCM prediction. Specifically, GAPN learns to efficiently search through the immense assembly space and optimize the direct docking reward through policy gradient. Importantly, we design an adversarial reward function to enhance the receptive field of our model. In this way, GAPN will simultaneously focus on a specific batch of complexes and the global assembly rules learned from complexes with varied chain numbers. Empirically, we have achieved both significant accuracy (measured by RMSD and TM-Score) and efficiency improvements compared to leading PCM softwares. \ No newline at end of file diff --git a/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks b/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks new file mode 100644 index 0000000000..697ce356a9 --- /dev/null +++ b/data/2024/iclr/Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks @@ -0,0 +1 @@ +Many robot manipulation tasks can be framed as geometric reasoning tasks, where an agent must be able to precisely manipulate an object into a position that satisfies the task from a set of initial conditions. Often, task success is defined based on the relationship between two objects - for instance, hanging a mug on a rack. In such cases, the solution should be equivariant to the initial position of the objects as well as the agent, and invariant to the pose of the camera. This poses a challenge for learning systems which attempt to solve this task by learning directly from high-dimensional demonstrations: the agent must learn to be both equivariant as well as precise, which can be challenging without any inductive biases about the problem. In this work, we propose a method for precise relative pose prediction which is provably SE(3)-equivariant, can be learned from only a few demonstrations, and can generalize across variations in a class of objects. We accomplish this by factoring the problem into learning an SE(3) invariant task-specific representation of the scene and then interpreting this representation with novel geometric reasoning layers which are provably SE(3) equivariant. We demonstrate that our method can yield substantially more precise placement predictions in simulated placement tasks than previous methods trained with the same amount of data, and can accurately represent relative placement relationships data collected from real-world demonstrations. Supplementary information and videos can be found at https://sites.google.com/view/reldist-iclr-2023. \ No newline at end of file diff --git a/data/2024/iclr/Deep Temporal Graph Clustering b/data/2024/iclr/Deep Temporal Graph Clustering new file mode 100644 index 0000000000..5973ecbb6a --- /dev/null +++ b/data/2024/iclr/Deep Temporal Graph Clustering @@ -0,0 +1 @@ +Deep graph clustering has recently received significant attention due to its ability to enhance the representation learning capabilities of models in unsupervised scenarios. Nevertheless, deep clustering for temporal graphs, which could capture crucial dynamic interaction information, has not been fully explored. It means that in many clustering-oriented real-world scenarios, temporal graphs can only be processed as static graphs. This not only causes the loss of dynamic information but also triggers huge computational consumption. To solve the problem, we propose a general framework for deep Temporal Graph Clustering called TGC, which introduces deep clustering techniques to suit the interaction sequence-based batch-processing pattern of temporal graphs. In addition, we discuss differences between temporal graph clustering and static graph clustering from several levels. To verify the superiority of the proposed framework TGC, we conduct extensive experiments. The experimental results show that temporal graph clustering enables more flexibility in finding a balance between time and space requirements, and our framework can effectively improve the performance of existing temporal graph learning methods. The code is released: https://github.com/MGitHubL/Deep-Temporal-Graph-Clustering. \ No newline at end of file diff --git a/data/2024/iclr/DeepSPF: Spherical SO(3)-Equivariant Patches for Scan-to-CAD Estimation b/data/2024/iclr/DeepSPF: Spherical SO(3)-Equivariant Patches for Scan-to-CAD Estimation new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training b/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training new file mode 100644 index 0000000000..7e6ec80e00 --- /dev/null +++ b/data/2024/iclr/DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training @@ -0,0 +1 @@ +Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box. \ No newline at end of file diff --git a/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation b/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation new file mode 100644 index 0000000000..8f251e1dad --- /dev/null +++ b/data/2024/iclr/Defining Expertise: Applications to Treatment Effect Estimation @@ -0,0 +1 @@ +Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and"expertise"is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection. \ No newline at end of file diff --git a/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs b/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs new file mode 100644 index 0000000000..5ad5391c20 --- /dev/null +++ b/data/2024/iclr/Defining and extracting generalizable interaction primitives from DNNs @@ -0,0 +1 @@ +Faithfully summarizing the knowledge encoded by a deep neural network (DNN) into a few symbolic primitive patterns without losing much information represents a core challenge in explainable AI. To this end, Ren et al. (2023c) have derived a series of theorems to prove that the inference score of a DNN can be explained as a small set of interactions between input variables. However, the lack of generalization power makes it still hard to consider such interactions as faithful primitive patterns encoded by the DNN. Therefore, given different DNNs trained for the same task, we develop a new method to extract interactions that are shared by these DNNs. Experiments show that the extracted interactions can better reflect common knowledge shared by different DNNs. \ No newline at end of file diff --git a/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding b/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding new file mode 100644 index 0000000000..27b02ce91b --- /dev/null +++ b/data/2024/iclr/Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding @@ -0,0 +1 @@ +A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice. \ No newline at end of file diff --git a/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models b/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models new file mode 100644 index 0000000000..e71d9806ab --- /dev/null +++ b/data/2024/iclr/Delta-AI: Local objectives for amortized inference in sparse graphical models @@ -0,0 +1 @@ +We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $\Delta$-amortized inference ($\Delta$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $\Delta$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $\Delta$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure. \ No newline at end of file diff --git a/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models b/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models new file mode 100644 index 0000000000..2c9dd1432a --- /dev/null +++ b/data/2024/iclr/Democratizing Fine-grained Visual Recognition with Large Language Models @@ -0,0 +1 @@ +Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous. \ No newline at end of file diff --git a/data/2024/iclr/Demonstration-Regularized RL b/data/2024/iclr/Demonstration-Regularized RL new file mode 100644 index 0000000000..407fb1a6a7 --- /dev/null +++ b/data/2024/iclr/Demonstration-Regularized RL @@ -0,0 +1 @@ +Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works. \ No newline at end of file diff --git a/data/2024/iclr/Demystifying CLIP Data b/data/2024/iclr/Demystifying CLIP Data new file mode 100644 index 0000000000..60b4acd7f7 --- /dev/null +++ b/data/2024/iclr/Demystifying CLIP Data @@ -0,0 +1 @@ +Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP. \ No newline at end of file diff --git a/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models b/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models new file mode 100644 index 0000000000..6304dc56a8 --- /dev/null +++ b/data/2024/iclr/Demystifying Embedding Spaces using Large Language Models @@ -0,0 +1 @@ +Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing Large Language Models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. \ No newline at end of file diff --git a/data/2024/iclr/Demystifying Linear MDPs and Novel Dynamics Aggregation Framework b/data/2024/iclr/Demystifying Linear MDPs and Novel Dynamics Aggregation Framework new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition b/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition new file mode 100644 index 0000000000..41f03a8fbb --- /dev/null +++ b/data/2024/iclr/Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition @@ -0,0 +1 @@ +This work presents an information-theoretic perspective to group fairness trade-offs in federated learning (FL) with respect to sensitive attributes, such as gender, race, etc. Existing works often focus on either $\textit{global fairness}$ (overall disparity of the model across all clients) or $\textit{local fairness}$ (disparity of the model at each client), without always considering their trade-offs. There is a lack of understanding regarding the interplay between global and local fairness in FL, particularly under data heterogeneity, and if and when one implies the other. To address this gap, we leverage a body of work in information theory called partial information decomposition (PID), which first identifies three sources of unfairness in FL, namely, $\textit{Unique Disparity}$, $\textit{Redundant Disparity}$, and $\textit{Masked Disparity}$. We demonstrate how these three disparities contribute to global and local fairness using canonical examples. This decomposition helps us derive fundamental limits on the trade-off between global and local fairness, highlighting where they agree or disagree. We introduce the $\textit{Accuracy and Global-Local Fairness Optimality Problem (AGLFOP)}$, a convex optimization that defines the theoretical limits of accuracy and fairness trade-offs, identifying the best possible performance any FL strategy can attain given a dataset and client distribution. We also present experimental results on synthetic datasets and the ADULT dataset to support our theoretical findings. \ No newline at end of file diff --git a/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective b/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective new file mode 100644 index 0000000000..cd19c3ae7b --- /dev/null +++ b/data/2024/iclr/Demystifying Poisoning Backdoor Attacks from a Statistical Perspective @@ -0,0 +1 @@ +The growing dependence on machine learning in real-world applications emphasizes the importance of understanding and ensuring its safety. Backdoor attacks pose a significant security risk due to their stealthy nature and potentially serious consequences. Such attacks involve embedding triggers within a learning model with the intention of causing malicious behavior when an active trigger is present while maintaining regular functionality without it. This paper evaluates the effectiveness of any backdoor attack incorporating a constant trigger, by establishing tight lower and upper boundaries for the performance of the compromised model on both clean and backdoor test data. The developed theory answers a series of fundamental but previously underexplored problems, including (1) what are the determining factors for a backdoor attack's success, (2) what is the direction of the most effective backdoor attack, and (3) when will a human-imperceptible trigger succeed. Our derived understanding applies to both discriminative and generative models. We also demonstrate the theory by conducting experiments using benchmark datasets and state-of-the-art backdoor attack scenarios. \ No newline at end of file diff --git a/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning b/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning new file mode 100644 index 0000000000..3a991287eb --- /dev/null +++ b/data/2024/iclr/Denevil: towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning @@ -0,0 +1 @@ +Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs. \ No newline at end of file diff --git a/data/2024/iclr/Denoising Diffusion Bridge Models b/data/2024/iclr/Denoising Diffusion Bridge Models new file mode 100644 index 0000000000..a4000324a1 --- /dev/null +++ b/data/2024/iclr/Denoising Diffusion Bridge Models @@ -0,0 +1 @@ +Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task. \ No newline at end of file diff --git a/data/2024/iclr/Denoising Diffusion Step-aware Models b/data/2024/iclr/Denoising Diffusion Step-aware Models new file mode 100644 index 0000000000..87f092539e --- /dev/null +++ b/data/2024/iclr/Denoising Diffusion Step-aware Models @@ -0,0 +1 @@ +Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. \ No newline at end of file diff --git a/data/2024/iclr/Denoising Diffusion via Image-Based Rendering b/data/2024/iclr/Denoising Diffusion via Image-Based Rendering new file mode 100644 index 0000000000..a2dd8be23b --- /dev/null +++ b/data/2024/iclr/Denoising Diffusion via Image-Based Rendering @@ -0,0 +1 @@ +Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not capable of reconstructing detailed, large-scale scenes in the wild, as they use limited-capacity 3D scene representations, require aligned camera poses, or rely on additional regularizers. In this work, we introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes. To achieve this, we make three contributions. First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes, dynamically allocating more capacity as needed to capture details visible in each image. Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images without the need for any additional supervision signal such as masks or depths. This supports 3D reconstruction and generation in a unified architecture. Third, we develop a principled approach to avoid trivial 3D solutions when integrating the image-based rendering with the diffusion model, by dropping out representations of some images. We evaluate the model on several challenging datasets of real and synthetic images, and demonstrate superior results on generation, novel view synthesis and 3D reconstruction. \ No newline at end of file diff --git a/data/2024/iclr/Denoising Task Routing for Diffusion Models b/data/2024/iclr/Denoising Task Routing for Diffusion Models new file mode 100644 index 0000000000..62ad80f1ea --- /dev/null +++ b/data/2024/iclr/Denoising Task Routing for Diffusion Models @@ -0,0 +1 @@ +Diffusion models generate highly realistic images by learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments reveal that DTR not only consistently boosts diffusion models' performance across different evaluation protocols without adding extra parameters but also accelerates training convergence. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL in the context of diffusion training. Significantly, by leveraging this complementarity, we attain matched performance of DiT-XL using the smaller DiT-L with a reduction in training iterations from 7M to 2M. \ No newline at end of file diff --git a/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit b/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit new file mode 100644 index 0000000000..4ee1773dbb --- /dev/null +++ b/data/2024/iclr/Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit @@ -0,0 +1 @@ +The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit. \ No newline at end of file diff --git a/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess b/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess new file mode 100644 index 0000000000..8504cf2c4b --- /dev/null +++ b/data/2024/iclr/Designing Skill-Compatible AI: Methodologies and Frameworks in Chess @@ -0,0 +1 @@ +Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility. \ No newline at end of file diff --git a/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization b/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization new file mode 100644 index 0000000000..c375cc6c5a --- /dev/null +++ b/data/2024/iclr/Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization @@ -0,0 +1 @@ +This paper introduces a new method for minimizing matrix-smooth non-convex objectives through the use of novel Compressed Gradient Descent (CGD) algorithms enhanced with a matrix-valued stepsize. The proposed algorithms are theoretically analyzed first in the single-node and subsequently in the distributed settings. Our theoretical results reveal that the matrix stepsize in CGD can capture the objective's structure and lead to faster convergence compared to a scalar stepsize. As a byproduct of our general results, we emphasize the importance of selecting the compression mechanism and the matrix stepsize in a layer-wise manner, taking advantage of model structure. Moreover, we provide theoretical guarantees for free compression, by designing specific layer-wise compressors for the non-convex matrix smooth objectives. Our findings are supported with empirical evidence. \ No newline at end of file diff --git a/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy b/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy new file mode 100644 index 0000000000..a6a59b4dc3 --- /dev/null +++ b/data/2024/iclr/Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy @@ -0,0 +1 @@ +Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at \url{https://github.com/ZSHsh98/MMD-MP}. \ No newline at end of file diff --git a/data/2024/iclr/Detecting Pretraining Data from Large Language Models b/data/2024/iclr/Detecting Pretraining Data from Large Language Models new file mode 100644 index 0000000000..55fa5bf89f --- /dev/null +++ b/data/2024/iclr/Detecting Pretraining Data from Large Language Models @@ -0,0 +1 @@ +Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to two real-world scenarios, copyrighted book detection, and contaminated downstream example detection, and find it a consistently effective solution. \ No newline at end of file diff --git a/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models b/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models new file mode 100644 index 0000000000..db4fbbda1d --- /dev/null +++ b/data/2024/iclr/Detecting, Explaining, and Mitigating Memorization in Diffusion Models @@ -0,0 +1 @@ +Recent breakthroughs in diffusion models have exhibited exceptional image-generation capabilities. However, studies show that some outputs are merely replications of training data. Such replications present potential legal challenges for model owners, especially when the generated content contains proprietary information. In this work, we introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions. Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step, with a single generation per prompt. Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization. This offers an interactive medium for users to adjust their prompts. Moreover, we propose two strategies i.e., to mitigate memorization by leveraging the magnitude of text-conditional predictions, either through minimization during inference or filtering during training. These proposed strategies effectively counteract memorization while maintaining high-generation quality. Code is available at https://github.com/YuxinWenRick/diffusion_memorization. \ No newline at end of file diff --git a/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models b/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models new file mode 100644 index 0000000000..b0ca2e2d58 --- /dev/null +++ b/data/2024/iclr/DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models @@ -0,0 +1 @@ +Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models (LLMs) with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu's capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to leverage knowledge-driven capability in decision-making for autonomous vehicles. Through the proposed DiLu framework, LLM is strengthened to apply knowledge and to reason causally in the autonomous driving domain. Project page: https://pjlab-adg.github.io/DiLu/ \ No newline at end of file diff --git a/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making b/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making new file mode 100644 index 0000000000..517052dca2 --- /dev/null +++ b/data/2024/iclr/Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making @@ -0,0 +1 @@ +Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains. \ No newline at end of file diff --git a/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking b/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking new file mode 100644 index 0000000000..8caf90727c --- /dev/null +++ b/data/2024/iclr/Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking @@ -0,0 +1 @@ +Recent work by Power et al. (2022) highlighted a surprising"grokking"phenomenon in learning arithmetic tasks: a neural net first"memorizes"the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped at a solution corresponding to a kernel predictor for a long time, and then a very sharp transition to min-norm/max-margin predictors occurs, leading to a dramatic change in test accuracy. \ No newline at end of file diff --git a/data/2024/iclr/Dictionary Contrastive Learning for Efficient Local Supervision without Auxiliary Networks b/data/2024/iclr/Dictionary Contrastive Learning for Efficient Local Supervision without Auxiliary Networks new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation b/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation new file mode 100644 index 0000000000..8310fcff94 --- /dev/null +++ b/data/2024/iclr/DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation @@ -0,0 +1 @@ +Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems. \ No newline at end of file diff --git a/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder b/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder new file mode 100644 index 0000000000..5f234dcba6 --- /dev/null +++ b/data/2024/iclr/DiffEnc: Variational Diffusion with a Learned Encoder @@ -0,0 +1 @@ +Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO. \ No newline at end of file diff --git a/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction b/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction new file mode 100644 index 0000000000..7c7591d51b --- /dev/null +++ b/data/2024/iclr/Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction @@ -0,0 +1 @@ +Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via varifold representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics. \ No newline at end of file diff --git a/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification b/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification new file mode 100644 index 0000000000..f2bd2020f4 --- /dev/null +++ b/data/2024/iclr/Differentiable Euler Characteristic Transforms for Shape Classification @@ -0,0 +1 @@ +The Euler Characteristic Transform (ECT) has proven to be a powerful representation, combining geometrical and topological characteristics of shapes and graphs. However, the ECT was hitherto unable to learn task-specific representations. We overcome this issue and develop a novel computational layer that enables learning the ECT in an end-to-end fashion. Our method, the Differentiable Euler Characteristic Transform (DECT), is fast and computationally efficient, while exhibiting performance on a par with more complex models in both graph and point cloud classification tasks. Moreover, we show that this seemingly simple statistic provides the same topological expressivity as more complex topological deep learning layers. \ No newline at end of file diff --git a/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks b/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks new file mode 100644 index 0000000000..470a8d406a --- /dev/null +++ b/data/2024/iclr/Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks @@ -0,0 +1 @@ +This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices. \ No newline at end of file diff --git a/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach b/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach new file mode 100644 index 0000000000..59b7d4898e --- /dev/null +++ b/data/2024/iclr/Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach @@ -0,0 +1 @@ +Differentially Private Stochastic Gradient Descent with Gradient Clipping (DPSGD-GC) is a powerful tool for training deep learning models using sensitive data, providing both a solid theoretical privacy guarantee and high efficiency. However, using DPSGD-GC to ensure Differential Privacy (DP) comes at the cost of model performance degradation due to DP noise injection and gradient clipping. Existing research has extensively analyzed the theoretical convergence of DPSGD-GC, and has shown that it only converges when using large clipping thresholds that are dependent on problem-specific parameters. Unfortunately, these parameters are often unknown in practice, making it hard to choose the optimal clipping threshold. Therefore, in practice, DPSGD-GC suffers from degraded performance due to the {\it constant} bias introduced by the clipping. In our work, we propose a new error-feedback (EF) DP algorithm as an alternative to DPSGD-GC, which not only offers a diminishing utility bound without inducing a constant clipping bias, but more importantly, it allows for an arbitrary choice of clipping threshold that is independent of the problem. We establish an algorithm-specific DP analysis for our proposed algorithm, providing privacy guarantees based on R{\'e}nyi DP. Additionally, we demonstrate that under mild conditions, our algorithm can achieve nearly the same utility bound as DPSGD without gradient clipping. Our empirical results on Cifar-10/100 and E2E datasets, show that the proposed algorithm achieves higher accuracies than DPSGD while maintaining the same level of DP guarantee. \ No newline at end of file diff --git a/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images b/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images new file mode 100644 index 0000000000..6f9d72f642 --- /dev/null +++ b/data/2024/iclr/Differentially Private Synthetic Data via Foundation Model APIs 1: Images @@ -0,0 +1 @@ +Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<= 7.9 with privacy cost {\epsilon} = 0.67, significantly improving the previous SOTA from {\epsilon} = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization b/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization new file mode 100644 index 0000000000..a158698b3b --- /dev/null +++ b/data/2024/iclr/Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization @@ -0,0 +1 @@ +We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional"flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion Model for Dense Matching b/data/2024/iclr/Diffusion Model for Dense Matching new file mode 100644 index 0000000000..7989a8289a --- /dev/null +++ b/data/2024/iclr/Diffusion Model for Dense Matching @@ -0,0 +1 @@ +The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, and large displacements. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms. Unlike previous approaches, this is accomplished by leveraging a conditional denoising diffusion model. DiffMatch consists of two main components: conditional denoising diffusion module and cost injection module. We stabilize the training process and reduce memory usage with a stage-wise training strategy. Furthermore, to boost performance, we introduce an inference technique that finds a better path to the accurate matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. Project page is available at https://ku-cvlab.github.io/DiffMatch/. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling b/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling new file mode 100644 index 0000000000..c0061f5400 --- /dev/null +++ b/data/2024/iclr/Diffusion Models for Multi-Task Generative Modeling @@ -0,0 +1 @@ +Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective b/data/2024/iclr/Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts b/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts new file mode 100644 index 0000000000..43abdf1952 --- /dev/null +++ b/data/2024/iclr/Diffusion Sampling with Momentum for Mitigating Divergence Artifacts @@ -0,0 +1 @@ +Despite the remarkable success of diffusion models in image generation, slow sampling remains a persistent issue. To accelerate the sampling process, prior studies have reformulated diffusion sampling as an ODE/SDE and introduced higher-order numerical methods. However, these methods often produce divergence artifacts, especially with a low number of sampling steps, which limits the achievable acceleration. In this paper, we investigate the potential causes of these artifacts and suggest that the small stability regions of these methods could be the principal cause. To address this issue, we propose two novel techniques. The first technique involves the incorporation of Heavy Ball (HB) momentum, a well-known technique for improving optimization, into existing diffusion numerical methods to expand their stability regions. We also prove that the resulting methods have first-order convergence. The second technique, called Generalized Heavy Ball (GHVB), constructs a new high-order method that offers a variable trade-off between accuracy and artifact suppression. Experimental results show that our techniques are highly effective in reducing artifacts and improving image quality, surpassing state-of-the-art diffusion solvers on both pixel-based and latent-based diffusion models for low-step sampling. Our research provides novel insights into the design of numerical methods for future diffusion work. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation b/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation new file mode 100644 index 0000000000..999941ce34 --- /dev/null +++ b/data/2024/iclr/Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation @@ -0,0 +1 @@ +Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings. Project page: https://wangruoyu02.github.io/cow.github.io/. \ No newline at end of file diff --git a/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation b/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation new file mode 100644 index 0000000000..c714851705 --- /dev/null +++ b/data/2024/iclr/Diffusion-TS: Interpretable Diffusion for General Time Series Generation @@ -0,0 +1 @@ +Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while transformers mine detailed sequential information from the noisy model input. Different from existing diffusion-based approaches, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. Diffusion-TS is expected to generate time series satisfying both interpretablity and realness. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series. \ No newline at end of file diff --git a/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models b/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models new file mode 100644 index 0000000000..47c371ba9f --- /dev/null +++ b/data/2024/iclr/DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models @@ -0,0 +1 @@ +Existing NAS methods suffer from either an excessive amount of time for repetitive sampling and training of many task-irrelevant architectures. To tackle such limitations of existing NAS methods, we propose a paradigm shift from NAS to a novel conditional Neural Architecture Generation (NAG) framework based on diffusion models, dubbed DiffusionNAG. Specifically, we consider the neural architectures as directed graphs and propose a graph diffusion model for generating them. Moreover, with the guidance of parameterized predictors, DiffusionNAG can flexibly generate task-optimal architectures with the desired properties for diverse tasks, by sampling from a region that is more likely to satisfy the properties. This conditional NAG scheme is significantly more efficient than previous NAS schemes which sample the architectures and filter them using the property predictors. We validate the effectiveness of DiffusionNAG through extensive experiments in two predictor-based NAS scenarios: Transferable NAS and Bayesian Optimization (BO)-based NAS. DiffusionNAG achieves superior performance with speedups of up to 35 times when compared to the baselines on Transferable NAS benchmarks. Furthermore, when integrated into a BO-based algorithm, DiffusionNAG outperforms existing BO-based NAS approaches, particularly in the large MobileNetV3 search space on the ImageNet 1K dataset. Code is available at https://github.com/CownowAn/DiffusionNAG. \ No newline at end of file diff --git a/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery b/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery new file mode 100644 index 0000000000..8987d2b3ad --- /dev/null +++ b/data/2024/iclr/DiffusionSat: A Generative Foundation Model for Satellite Imagery @@ -0,0 +1 @@ +Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery. The project website can be found here: https://samar-khanna.github.io/DiffusionSat/ \ No newline at end of file diff --git a/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards b/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards new file mode 100644 index 0000000000..2071975659 --- /dev/null +++ b/data/2024/iclr/Directly Fine-Tuning Diffusion Models on Differentiable Rewards @@ -0,0 +1 @@ +We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms. \ No newline at end of file diff --git a/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning b/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning new file mode 100644 index 0000000000..b12b99abee --- /dev/null +++ b/data/2024/iclr/Dirichlet-based Per-Sample Weighting by Transition Matrix for Noisy Label Learning @@ -0,0 +1 @@ +For learning with noisy labels, the transition matrix, which explicitly models the relation between noisy label distribution and clean label distribution, has been utilized to achieve the statistical consistency of either the classifier or the risk. Previous researches have focused more on how to estimate this transition matrix well, rather than how to utilize it. We propose good utilization of the transition matrix is crucial and suggest a new utilization method based on resampling, coined RENT. Specifically, we first demonstrate current utilizations can have potential limitations for implementation. As an extension to Reweighting, we suggest the Dirichlet distribution-based per-sample Weight Sampling (DWS) framework, and compare reweighting and resampling under DWS framework. With the analyses from DWS, we propose RENT, a REsampling method with Noise Transition matrix. Empirically, RENT consistently outperforms existing transition matrix utilization methods, which includes reweighting, on various benchmark datasets. Our code is available at \url{https://github.com/BaeHeeSun/RENT}. \ No newline at end of file diff --git a/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search b/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search new file mode 100644 index 0000000000..a46f60e310 --- /dev/null +++ b/data/2024/iclr/Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search @@ -0,0 +1 @@ +Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects. Project page: https://sage-diffusion.github.io/ \ No newline at end of file diff --git a/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms b/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms new file mode 100644 index 0000000000..6567164a6b --- /dev/null +++ b/data/2024/iclr/Discovering Temporally-Aware Reinforcement Learning Algorithms @@ -0,0 +1 @@ +Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or"training horizon". In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent's training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent's lifetime. \ No newline at end of file diff --git a/data/2024/iclr/Discovering modular solutions that generalize compositionally b/data/2024/iclr/Discovering modular solutions that generalize compositionally new file mode 100644 index 0000000000..8a4e6c374f --- /dev/null +++ b/data/2024/iclr/Discovering modular solutions that generalize compositionally @@ -0,0 +1 @@ +Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which circumstances modular systems can discover hidden compositional structure. To shed light on this question, we study a teacher-student setting with a modular teacher where we have full control over the composition of ground truth modules. This allows us to relate the problem of compositional generalization to that of identification of the underlying modules. In particular we study modularity in hypernetworks representing a general class of multiplicative interactions. We show theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations. We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments. \ No newline at end of file diff --git a/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation b/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation new file mode 100644 index 0000000000..f15776ab1d --- /dev/null +++ b/data/2024/iclr/DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation @@ -0,0 +1 @@ +Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability \ No newline at end of file diff --git a/data/2024/iclr/Disentangling Time Series Representations via Contrastive Independence-of-Support on l-Variational Inference b/data/2024/iclr/Disentangling Time Series Representations via Contrastive Independence-of-Support on l-Variational Inference new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI b/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI new file mode 100644 index 0000000000..aef6212a3c --- /dev/null +++ b/data/2024/iclr/Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI @@ -0,0 +1 @@ +Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify"hard"samples. However, there is a lack of consensus regarding the definition and evaluation of"hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of hardness types. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods. \ No newline at end of file diff --git a/data/2024/iclr/Dissecting learning and forgetting in language model finetuning b/data/2024/iclr/Dissecting learning and forgetting in language model finetuning new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation b/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation new file mode 100644 index 0000000000..bf00c4744b --- /dev/null +++ b/data/2024/iclr/DistillSpec: Improving Speculative Decoding via Knowledge Distillation @@ -0,0 +1 @@ +Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation. \ No newline at end of file diff --git a/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF b/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF new file mode 100644 index 0000000000..3874ac8277 --- /dev/null +++ b/data/2024/iclr/Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF @@ -0,0 +1 @@ +In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context \ No newline at end of file diff --git a/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction b/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction new file mode 100644 index 0000000000..023434fdc0 --- /dev/null +++ b/data/2024/iclr/Distributionally Robust Optimization with Bias and Variance Reduction @@ -0,0 +1 @@ +We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains. \ No newline at end of file diff --git a/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots b/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots new file mode 100644 index 0000000000..a629a550c9 --- /dev/null +++ b/data/2024/iclr/DittoGym: Learning to Control Soft Shape-Shifting Robots @@ -0,0 +1 @@ +Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io. \ No newline at end of file diff --git a/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning b/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning new file mode 100644 index 0000000000..20fc758bca --- /dev/null +++ b/data/2024/iclr/Diverse Projection Ensembles for Distributional Reinforcement Learning @@ -0,0 +1 @@ +In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projection of the unconstrained distribution onto the set of simplified distributions. We argue that this projection step entails a strong inductive bias when coupled with neural networks and gradient descent, thereby profoundly impacting the generalization behavior of learned models. In order to facilitate reliable uncertainty estimation through diversity, this work studies the combination of several different projections and representations in a distributional ensemble. We establish theoretical properties of such projection ensembles and derive an algorithm that uses ensemble disagreement, measured by the average $1$-Wasserstein distance, as a bonus for deep exploration. We evaluate our algorithm on the behavior suite benchmark and find that diverse projection ensembles lead to significant performance improvements over existing methods on a wide variety of tasks with the most pronounced gains in directed exploration problems. \ No newline at end of file diff --git a/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning b/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning new file mode 100644 index 0000000000..d902e297cb --- /dev/null +++ b/data/2024/iclr/Divide and not forget: Ensemble of selectively trained experts in Continual Learning @@ -0,0 +1 @@ +Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know. A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task. However, the experts are usually trained all at once using whole task data, which makes them all prone to forgetting and increasing computational burden. To address this limitation, we introduce a novel approach named SEED. SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert. For this purpose, each expert represents each class with a Gaussian distribution, and the optimal expert is selected based on the similarity of those distributions. Consequently, SEED increases diversity and heterogeneity within the experts while maintaining the high stability of this ensemble method. The extensive experiments demonstrate that SEED achieves state-of-the-art performance in exemplar-free settings across various scenarios, showing the potential of expert diversification through data in continual learning. \ No newline at end of file diff --git a/data/2024/iclr/Do Generated Data Always Help Contrastive Learning? b/data/2024/iclr/Do Generated Data Always Help Contrastive Learning? new file mode 100644 index 0000000000..642b4c70c6 --- /dev/null +++ b/data/2024/iclr/Do Generated Data Always Help Contrastive Learning? @@ -0,0 +1 @@ +Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf. \ No newline at end of file diff --git a/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models b/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models new file mode 100644 index 0000000000..4f1023eb5d --- /dev/null +++ b/data/2024/iclr/DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models @@ -0,0 +1 @@ +Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts. \ No newline at end of file diff --git a/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity? b/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity? new file mode 100644 index 0000000000..d437952881 --- /dev/null +++ b/data/2024/iclr/Does CLIP's generalization performance mainly stem from high train-test similarity? @@ -0,0 +1 @@ +Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet's train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP's overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP's OOD performance, and other properties of the training data must drive CLIP to learn more generalizable representations. Additionally, by pruning data points that are dissimilar to the OOD benchmarks, we uncover a 100M split of LAION ($\frac{1}{4}$th of its original size) on which CLIP can be trained to match its original OOD performance. \ No newline at end of file diff --git a/data/2024/iclr/Does Progress On Object Recognition Benchmarks Improve Generalization on Crowdsourced, Global Data? b/data/2024/iclr/Does Progress On Object Recognition Benchmarks Improve Generalization on Crowdsourced, Global Data? new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity? b/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity? new file mode 100644 index 0000000000..bc41635b69 --- /dev/null +++ b/data/2024/iclr/Does Writing with Language Models Reduce Content Diversity? @@ -0,0 +1 @@ +Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups -- using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content. \ No newline at end of file diff --git a/data/2024/iclr/Domain Randomization via Entropy Maximization b/data/2024/iclr/Domain Randomization via Entropy Maximization new file mode 100644 index 0000000000..2f883d2109 --- /dev/null +++ b/data/2024/iclr/Domain Randomization via Entropy Maximization @@ -0,0 +1 @@ +Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters. \ No newline at end of file diff --git a/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing b/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing new file mode 100644 index 0000000000..539958e20d --- /dev/null +++ b/data/2024/iclr/Domain constraints improve risk prediction when outcome data is missing @@ -0,0 +1 @@ +Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings. \ No newline at end of file diff --git a/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback b/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback new file mode 100644 index 0000000000..25fe0122e7 --- /dev/null +++ b/data/2024/iclr/Domain-Agnostic Molecular Generation with Chemical Feedback @@ -0,0 +1 @@ +The generation of molecules with desired properties has become increasingly popular, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face challenges such as generating syntactically or chemically flawed molecules, having narrow domain focus, and struggling to create diverse and feasible molecules due to limited annotated data or external molecular databases. To tackle these challenges, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. Through the reconstruction of over 100 million molecular SELFIES, MolGen internalizes structural and grammatical insights. This is further enhanced by domain-agnostic molecular prefix tuning, fostering robust knowledge transfer across diverse domains. Importantly, our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences. Extensive experiments on well-known benchmarks underscore MolGen's optimization capabilities in properties such as penalized logP, QED, and molecular docking. Additional analyses confirm its proficiency in accurately capturing molecule distributions, discerning intricate structural patterns, and efficiently exploring the chemical space. Code is available at https://github.com/zjunlp/MolGen. \ No newline at end of file diff --git a/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts b/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts new file mode 100644 index 0000000000..b637825e6b --- /dev/null +++ b/data/2024/iclr/Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts @@ -0,0 +1 @@ +This paper presents a Domain-Inspired Sharpness-Aware Minimization (DISAM) algorithm for optimization under domain shifts. It is motivated by the inconsistent convergence degree of SAM across different domains, which induces optimization bias towards certain domains and thus impairs the overall convergence. To address this issue, we consider the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming (deficient) perturbations for less (well) optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss, which allows the elastic gradient calibration in perturbation generation: when one domain is optimized above the averaging level \textit{w.r.t.} loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa. Under this mechanism, we theoretically show that DISAM can achieve faster overall convergence and improved generalization in principle when inconsistent convergence emerges. Extensive experiments on various domain generalization benchmarks show the superiority of DISAM over a range of state-of-the-art methods. Furthermore, we show the superior efficiency of DISAM in parameter-efficient fine-tuning combined with the pretraining models. The source code is released at https://github.com/MediaBrain-SJTU/DISAM. \ No newline at end of file diff --git a/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation b/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation new file mode 100644 index 0000000000..b56acb72c0 --- /dev/null +++ b/data/2024/iclr/Don't Judge by the Look: Towards Motion Coherent Video Representation @@ -0,0 +1 @@ +Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch. \ No newline at end of file diff --git a/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models b/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models new file mode 100644 index 0000000000..de437035b9 --- /dev/null +++ b/data/2024/iclr/Don't Play Favorites: Minority Guidance for Diffusion Models @@ -0,0 +1 @@ +We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work. Code is available at https://github.com/soobin-um/minority-guidance. \ No newline at end of file diff --git a/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization b/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization new file mode 100644 index 0000000000..83f1b2463a --- /dev/null +++ b/data/2024/iclr/Don't Trust: Verify - Grounding LLM Quantitative Reasoning with Autoformalization @@ -0,0 +1 @@ +Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv. \ No newline at end of file diff --git a/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training b/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training new file mode 100644 index 0000000000..5023909f5c --- /dev/null +++ b/data/2024/iclr/Doubly Robust Instance-Reweighted Adversarial Training @@ -0,0 +1 @@ +Assigning importance weights to adversarial data has achieved great success in training adversarially robust networks under limited model capacity. However, existing instance-reweighted adversarial training (AT) methods heavily depend on heuristics and/or geometric interpretations to determine those importance weights, making these algorithms lack rigorous theoretical justification/guarantee. Moreover, recent research has shown that adversarial training suffers from a severe non-uniform robust performance across the training distribution, e.g., data points belonging to some classes can be much more vulnerable to adversarial attacks than others. To address both issues, in this paper, we propose a novel doubly-robust instance reweighted AT framework, which allows to obtain the importance weights via exploring distributionally robust optimization (DRO) techniques, and at the same time boosts the robustness on the most vulnerable examples. In particular, our importance weights are obtained by optimizing the KL-divergence regularized loss function, which allows us to devise new algorithms with a theoretical convergence guarantee. Experiments on standard classification datasets demonstrate that our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance, and at the same time improves the robustness against attacks on the weakest data points. Codes will be available soon. \ No newline at end of file diff --git a/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments b/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments new file mode 100644 index 0000000000..ee1964b2e7 --- /dev/null +++ b/data/2024/iclr/Doubly Robust Proximal Causal Learning for Continuous Treatments @@ -0,0 +1 @@ +Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications. \ No newline at end of file diff --git a/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization b/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization new file mode 100644 index 0000000000..238b82cd94 --- /dev/null +++ b/data/2024/iclr/DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization @@ -0,0 +1 @@ +Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations. \ No newline at end of file diff --git a/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks b/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks new file mode 100644 index 0000000000..1142bd984f --- /dev/null +++ b/data/2024/iclr/DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks @@ -0,0 +1 @@ +The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (https://sites.google.com/view/iclr24drs) for more details. \ No newline at end of file diff --git a/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models b/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models new file mode 100644 index 0000000000..2221f4bdb8 --- /dev/null +++ b/data/2024/iclr/DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models @@ -0,0 +1 @@ +Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at https://github.com/MC-E/DragonDiffusion. \ No newline at end of file diff --git a/data/2024/iclr/DreamClean: Restoring Clean Image Using Deep Diffusion Prior b/data/2024/iclr/DreamClean: Restoring Clean Image Using Deep Diffusion Prior new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior b/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior new file mode 100644 index 0000000000..0641986785 --- /dev/null +++ b/data/2024/iclr/DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior @@ -0,0 +1 @@ +We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation. Code available at https://github.com/deepseek-ai/DreamCraft3D. \ No newline at end of file diff --git a/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow b/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow new file mode 100644 index 0000000000..7db74b4996 --- /dev/null +++ b/data/2024/iclr/DreamFlow: High-quality text-to-3D generation by Approximating Probability Flow @@ -0,0 +1 @@ +Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations. \ No newline at end of file diff --git a/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation b/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation new file mode 100644 index 0000000000..8430e49f89 --- /dev/null +++ b/data/2024/iclr/DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation @@ -0,0 +1 @@ +Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods. \ No newline at end of file diff --git a/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation b/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation new file mode 100644 index 0000000000..16d682dbbe --- /dev/null +++ b/data/2024/iclr/DreamLLM: Synergistic Multimodal Comprehension and Creation @@ -0,0 +1 @@ +This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io. \ No newline at end of file diff --git a/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing b/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing new file mode 100644 index 0000000000..d333ab8c98 --- /dev/null +++ b/data/2024/iclr/DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing @@ -0,0 +1 @@ +Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks. \ No newline at end of file diff --git a/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation b/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation new file mode 100644 index 0000000000..fe36cc7f23 --- /dev/null +++ b/data/2024/iclr/DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation @@ -0,0 +1 @@ +Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and texture; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity. \ No newline at end of file diff --git a/data/2024/iclr/Dropout Enhanced Bilevel Training b/data/2024/iclr/Dropout Enhanced Bilevel Training new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation b/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation new file mode 100644 index 0000000000..8bf03cf24c --- /dev/null +++ b/data/2024/iclr/Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation @@ -0,0 +1 @@ +Predictive multiplicity refers to the phenomenon in which classification tasks may admit multiple competing models that achieve almost-equally-optimal performance, yet generate conflicting outputs for individual samples. This presents significant concerns, as it can potentially result in systemic exclusion, inexplicable discrimination, and unfairness in practical applications. Measuring and mitigating predictive multiplicity, however, is computationally challenging due to the need to explore all such almost-equally-optimal models, known as the Rashomon set, in potentially huge hypothesis spaces. To address this challenge, we propose a novel framework that utilizes dropout techniques for exploring models in the Rashomon set. We provide rigorous theoretical derivations to connect the dropout parameters to properties of the Rashomon set, and empirically evaluate our framework through extensive experimentation. Numerical results show that our technique consistently outperforms baselines in terms of the effectiveness of predictive multiplicity metric estimation, with runtime speedup up to $20\times \sim 5000\times$. With efficient Rashomon set exploration and metric estimation, mitigation of predictive multiplicity is then achieved through dropout ensemble and model selection. \ No newline at end of file diff --git a/data/2024/iclr/Dual Associated Encoder for Face Restoration b/data/2024/iclr/Dual Associated Encoder for Face Restoration new file mode 100644 index 0000000000..7098ee0b68 --- /dev/null +++ b/data/2024/iclr/Dual Associated Encoder for Face Restoration @@ -0,0 +1 @@ +Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance. To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details. Project page: https://liagm.github.io/DAEFR/ \ No newline at end of file diff --git a/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning b/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning new file mode 100644 index 0000000000..d31da54de0 --- /dev/null +++ b/data/2024/iclr/Dual RL: Unification and New Methods for Reinforcement and Imitation Learning @@ -0,0 +1 @@ +The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. In this work, we first cast several state-of-the-art offline RL and offline imitation learning (IL) algorithms as instances of dual RL approaches with shared structures. Such unification allows us to identify the root cause of the shortcomings of prior methods. For offline IL, our analysis shows that prior methods are based on a restrictive coverage assumption that greatly limits their performance in practice. To fix this limitation, we propose a new discriminator-free method ReCOIL that learns to imitate from arbitrary off-policy data to obtain near-expert performance. For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method f-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL. The performance improvements by both of our proposed methods, ReCOIL and f-DVL, in IL and RL are validated on an extensive suite of simulated robot locomotion and manipulation tasks. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl. \ No newline at end of file diff --git a/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment b/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment new file mode 100644 index 0000000000..83e88f46fe --- /dev/null +++ b/data/2024/iclr/Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment @@ -0,0 +1 @@ +We introduce a novel task within the field of 3D dance generation, termed dance accompaniment, which necessitates the generation of responsive movements from a dance partner, the"follower", synchronized with the lead dancer's movements and the underlying musical rhythm. Unlike existing solo or group dance generation tasks, a duet dance scenario entails a heightened degree of interaction between the two participants, requiring delicate coordination in both pose and position. To support this task, we first build a large-scale and diverse duet interactive dance dataset, DD100, by recording about 117 minutes of professional dancers' performances. To address the challenges inherent in this task, we propose a GPT-based model, Duolando, which autoregressively predicts the subsequent tokenized motion conditioned on the coordinated information of the music, the leader's and the follower's movements. To further enhance the GPT's capabilities of generating stable results on unseen conditions (music and leader motions), we devise an off-policy reinforcement learning strategy that allows the model to explore viable trajectories from out-of-distribution samplings, guided by human-defined rewards. Based on the collected dataset and proposed method, we establish a benchmark with several carefully designed metrics. \ No newline at end of file diff --git a/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos b/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos new file mode 100644 index 0000000000..041aefecfb --- /dev/null +++ b/data/2024/iclr/DyST: Towards Dynamic Neural Scene Representations on Real-World Videos @@ -0,0 +1 @@ +Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene. \ No newline at end of file diff --git a/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks b/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks new file mode 100644 index 0000000000..95f0a85ff5 --- /dev/null +++ b/data/2024/iclr/DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks @@ -0,0 +1 @@ +Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench. \ No newline at end of file diff --git a/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization b/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization new file mode 100644 index 0000000000..c7ccc00f45 --- /dev/null +++ b/data/2024/iclr/DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric Voxelization @@ -0,0 +1 @@ +Unsupervised learning of object-centric representations in dynamic visual scenes is challenging. Unlike most previous approaches that learn to decompose 2D images, we present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning in a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers the probability distribution over objects at individual spatial locations. These voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning via slot attention. The voxel features and global features are complementary and are both leveraged by a compositional NeRF decoder for volume rendering. DynaVol remarkably outperforms existing approaches for unsupervised dynamic scene decomposition. Once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve: it is possible to freely edit the geometric shapes or manipulate the motion trajectories of the objects. \ No newline at end of file diff --git a/data/2024/iclr/Dynamic Discounted Counterfactual Regret Minimization b/data/2024/iclr/Dynamic Discounted Counterfactual Regret Minimization new file mode 100644 index 0000000000..e69de29bb2 diff --git a/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers b/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers new file mode 100644 index 0000000000..6e57220c4e --- /dev/null +++ b/data/2024/iclr/Dynamic Layer Tying for Parameter-Efficient Transformers @@ -0,0 +1 @@ +In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j