Merge pull request #36 from ComplexData-MILA/gh-actions/auto-update-p…

…ublications-1730448151 [Automatic PR] Automatically add papers from authors
ComplexData-MILA · Nov 1, 2024 · 81fa415 · 81fa415
2 parents 9403f6c + f74e26d
commit 81fa415
Show file tree

Hide file tree

Showing 5 changed files with 76 additions and 4 deletions.
diff --git a/_posts/papers/2024-08-06-2408.02946.md b/_posts/papers/2024-08-06-2408.02946.md
@@ -1,10 +1,10 @@
 ---
-title: Scaling Laws for Data Poisoning in LLMs
-venue: arXiv.org
+title: 'Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws'
+venue: ''
 names: Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, A. Gleave, Kellin
   Pelrine
 tags:
-- arXiv.org
+- ''
 link: https://arxiv.org/abs/2408.02946
 author: Kellin Pelrine
 categories: Publications
@@ -19,4 +19,4 @@ categories: Publications
 
 ## Abstract
 
-Recent work shows that LLMs are vulnerable to data poisoning, in which they are trained on partially corrupted or harmful data. Poisoned data is hard to detect, breaks guardrails, and leads to undesirable and harmful behavior. Given the intense efforts by leading labs to train and deploy increasingly larger and more capable LLMs, it is critical to ask if the risk of data poisoning will be naturally mitigated by scale, or if it is an increasing threat. We consider three threat models by which data poisoning can occur: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments evaluate the effects of data poisoning on 23 frontier LLMs ranging from 1.5-72 billion parameters on three datasets which speak to each of our threat models. We find that larger LLMs are increasingly vulnerable, learning harmful behavior significantly more quickly than smaller LLMs with even minimal data poisoning. These results underscore the need for robust safeguards against data poisoning in larger LLMs.
+LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 23 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.
diff --git a/_posts/papers/2024-10-17-2410.13915.md b/_posts/papers/2024-10-17-2410.13915.md
@@ -0,0 +1,23 @@
+---
+title: A Simulation System Towards Solving Societal-Scale Manipulation
+venue: ''
+names: M. P. Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao,
+  Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, Camille Thibault,
+  Busra Tugce Gurbuz, Reihaneh Rabbany, J. Godbout, Kellin Pelrine
+tags:
+- ''
+link: https://arxiv.org/abs/2410.13915
+author: Zachary Yang
+categories: Publications
+
+---
+
+*{{ page.names }}*
+
+**{{ page.venue }}**
+
+{% include display-publication-links.html pub=page %}
+
+## Abstract
+
+The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-world settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents' political positions and show how partisan manipulation of agents can affect election results.
diff --git a/_posts/papers/2024-10-20-2410.15460.md b/_posts/papers/2024-10-20-2410.15460.md
@@ -0,0 +1,23 @@
+---
+title: 'Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model
+  Training'
+venue: ''
+names: Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany,
+  G. Farnadi
+tags:
+- ''
+link: https://arxiv.org/abs/2410.15460
+author: Shahrad Mohammadzadeh
+categories: Publications
+
+---
+
+*{{ page.names }}*
+
+**{{ page.venue }}**
+
+{% include display-publication-links.html pub=page %}
+
+## Abstract
+
+As large language models (LLMs) become increasingly deployed across various industries, concerns regarding their reliability, particularly due to hallucinations-outputs that are factually inaccurate or irrelevant to user input-have grown. Our research investigates the relationship between the training process and the emergence of hallucinations to address a key gap in existing research that focuses primarily on post hoc detection and mitigation strategies. Using models from the Pythia suite (70M-12B parameters) and several hallucination detection metrics, we analyze hallucination trends throughout training and explore LLM internal dynamics. We introduce SEnsitive Neuron Dropout (SeND), a novel training protocol designed to mitigate hallucinations by reducing variance during training. SeND achieves this by deterministically dropping neurons with significant variability on a dataset, referred to as Sensitive Neurons. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This efficient metric is integrated into our protocol, allowing SeND to be both computationally scalable and effective at reducing hallucinations. Our empirical evaluation demonstrates that our approach improves LLM reliability at test time by up to 40% compared to normal training while also providing an efficient method to improve factual accuracy when adapting LLMs to domains such as Wikipedia and Medical datasets.
diff --git a/_posts/papers/2024-10-22-2410.17358.md b/_posts/papers/2024-10-22-2410.17358.md
@@ -0,0 +1,22 @@
+---
+title: 'FairLoRA: Unpacking Bias Mitigation in Vision Models with Fairness-Driven
+  Low-Rank Adaptation'
+venue: ''
+names: Rohan Sukumaran, Aarash Feizi, Adriana Romero-Sorian, G. Farnadi
+tags:
+- ''
+link: https://arxiv.org/abs/2410.17358
+author: Aarash Feizi
+categories: Publications
+
+---
+
+*{{ page.names }}*
+
+**{{ page.venue }}**
+
+{% include display-publication-links.html pub=page %}
+
+## Abstract
+
+Recent advances in parameter-efficient fine-tuning methods, such as Low Rank Adaptation (LoRA), have gained significant attention for their ability to efficiently adapt large foundational models to various downstream tasks. These methods are appreciated for achieving performance comparable to full fine-tuning on aggregate-level metrics, while significantly reducing computational costs. To systematically address fairness in LLMs previous studies fine-tune on fairness specific data using a larger LoRA rank than typically used. In this paper, we introduce FairLoRA, a novel fairness-specific regularizer for LoRA aimed at reducing performance disparities across data subgroups by minimizing per-class variance in loss. To the best of our knowledge, we are the first to introduce a fairness based finetuning through LoRA. Our results demonstrate that the need for higher ranks to mitigate bias is not universal; it depends on factors such as the pre-trained model, dataset, and task. More importantly, we systematically evaluate FairLoRA across various vision models, including ViT, DiNO, and CLIP, in scenarios involving distribution shifts. We further emphasize the necessity of using multiple fairness metrics to obtain a holistic assessment of fairness, rather than relying solely on the metric optimized during training.
diff --git a/records/semantic_paper_ids_ignored.json b/records/semantic_paper_ids_ignored.json
@@ -54,6 +54,7 @@
   "251c301ee2e604ffcd313797c4cecfb500d936e6",
   "259cf65eeae13861031f44cf906d43b155192b10",
   "2677f411aae496be93ee70bcbf0eb0e949c13e0c",
+  "274a3340aa668667d68144385d12599003d76dde",
   "27fd24efc03fc9ec548d9f32ba93542addb7f26b",
   "284aea8f6b61133f1db5e8cfc4eda80bc1e22882",
   "28761feb82f380379538ac108b3bb7515d90b042",
@@ -190,6 +191,7 @@
   "8df1226c9e4ae891f1759983a3fb9603f1519ee7",
   "8e7f0ab1570b4d5b96b7813c1f571bbfeec6f098",
   "8ef7812c465d99da6cd33fd007295102dce0ceb7",
+  "8f22873355eedb8db3d209f93106f0bc33183c00",
   "8f76a0312dcbed23773e3e42bcf0b5ed534f6f5a",
   "90a35136bfdb158370cf5e7e6bc70fdb77fb5c8b",
   "913ffd0a2e7f794e9b2e60b9728673044be47b2a",
@@ -264,6 +266,7 @@
   "c7e94ec76ffd49121206a40048a36157440394f4",
   "c8206c0450c6928614577899b92fa389365c423d",
   "ca3bfeaaf87938454a789c7d8d30a5771491c632",
+  "cb7c4b1e0799c9a617cace6b4a756103b5884c2d",
   "cd02e0a094953077217e2e62f3557b36a365acff",
   "ce315c58316f5328949c39b7af9474969c044c5f",
   "ce9d13d236dea4399cca0cce4cee828205f9cec5",
@@ -301,6 +304,7 @@
   "e2a0c886f0629b0b01268a49648b1573bb09e88b",
   "e3446ef313663e30d8251dee339bca52962e7bfd",
   "e3919e94c811fd85f5038926fa354619861674f9",
+  "e4273f01e6d7cd41357ea3af5c1a3df4962dc124",
   "e468a3b84707e9828bac47aef37815d5b4818a19",
   "e507aa05192ca5565d9f9ab2c50710aed01a7652",
   "e66b88e4dc9a6d5896748f9f40f7bbb5e67b0645",