From 7e5b9db672a5808f00f77a3ffa1a2f7421d5b694 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Thu, 16 Jan 2025 00:44:36 +0000 Subject: [PATCH] Auto. Make Doomgrad HF Review on 16 January --- d/2025-01-15_zh_reading_task.html | 322 ++++++++ d/2025-01-16.html | 1270 +++++++++++++++++++++++++++++ d/2025-01-16.json | 1112 +++++++++++++++++++++++++ hf_papers.json | 332 ++++---- index.html | 22 +- log.txt | 6 +- logs/2025-01-16_last_log.txt | 170 ++++ m/2025-01.html | 8 +- 8 files changed, 3058 insertions(+), 184 deletions(-) create mode 100644 d/2025-01-15_zh_reading_task.html create mode 100644 d/2025-01-16.html create mode 100644 d/2025-01-16.json create mode 100644 logs/2025-01-16_last_log.txt diff --git a/d/2025-01-15_zh_reading_task.html b/d/2025-01-15_zh_reading_task.html new file mode 100644 index 00000000..2c93630f --- /dev/null +++ b/d/2025-01-15_zh_reading_task.html @@ -0,0 +1,322 @@ + + + + + + + + + + + Chinese reading task about ML + + + +
+

MiniMax-01: Scaling Foundation Models with Lightning Attention

+

1. 我们介绍了 MiniMax-01 系列,包括 MiniMax-Text-01 和 MiniMax-VL-01。

+

2. 这些模型在处理长上下文方面具有卓越能力。

+

3. 核心在于闪电注意力和其高效扩展。

+

4. 我们将其与混合专家模型(MoE)集成,创建了一个具有 32 个专家和 4560 亿总参数的模型。

+

5. 我们开发了优化的并行策略和高效的计算通信重叠技术。

+

6. 这使我们能够在数百亿参数的模型上进行高效训练和推理。

+

7. MiniMax-Text-01 的上下文窗口在训练期间可达到 100 万个标记,并在推理期间扩展到 400 万个标记。

+

8. MiniMax-VL-01 通过使用 5120 亿视觉语言标记进行持续训练。

+

9. 实验表明,我们的模型在标准和内部基准上的性能与 GPT-4o 和 Claude-3.5-Sonnet 相当,同时提供 20-32 倍的上下文窗口。

+

10. 我们在 https://github.com/MiniMax-AI 公开发布了 MiniMax-01。

+
+

1. Wǒmen jièshào le MiniMax-01 xìliè, bāokuò MiniMax-Text-01 hé MiniMax-VL-01

+

2. Zhèxiē móxíng zài chǔlǐ cháng shàngxìawén fāngmiàn jùyǒu zhuóyuè nénglì

+

3. Héxīn zàiyú shǎndiǎn zhùyìlì hé qí gāoxiào kuòzhǎn

+

4. Wǒmen jiāng qí yǔ hùn hé zhuānjiā móxíng (MoE) jíchéng, chuàngjiàn le yīgè jùyǒu 32 gè zhuānjiā hé 4560 yì zǒng cānshù de móxíng

+

5. Wǒmen kāifā le yōuhuà de bìngxíng cèlüè hé gāoxiào de jìsuàn tōngxìn zhòngdié jìshù

+

6. Zhè shǐ wǒmen nénggòu zài shùbǎiyì cānshù de móxíng shàng jìnxíng gāoxiào xùnliàn hé tuìlǐ

+

7. MiniMax-Text-01 de shàngxìawén chuāngkǒu zài xùnliàn qījiān kě dádào 100 wàn gè biāojì, bìng zài tuìlǐ qījiān kuòzhǎn dào 400 wàn gè biāojì

+

8. MiniMax-VL-01 tōngguò shǐyòng 5120 yì shìjué yǔyán biāojì jìnxíng chíxù xùnliàn

+

9. Shìyàn biǎomíng, wǒmen de móxíng zài biāozhǔn hé nèibù jīzhǔn shàng de xiàonénglì yǔ GPT-4o hé Claude-3

+

10. 5-Sonnet xiāngdāng, tóngshí tígōng 20-32 bèi de shàngxìawén chuāngkǒu

+

11. Wǒmen zài https://github

+

12. com/MiniMax-AI gōngkāi fābù le MiniMax-01

+
+

1. We introduced the MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01.

+

2. These models excel in handling long contexts, with a core focus on flash attention and its efficient scaling.

+

3. We integrated them with a Mixture of Experts (MoE) model, creating a model with 32 experts and a total of 4560 billion parameters.

+

4. We developed optimized parallel strategies and efficient computation-communication overlap techniques.

+

5. This enables us to perform efficient training and inference on models with hundreds of billions of parameters.

+

6. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and expands to 4 million tokens during inference.

+

7. MiniMax-VL-01 undergoes continuous training using 5120 billion vision-language tokens.

+

8. Experiments show that our models perform comparably to GPT-4o and Claude-3.

+

9. 5-Sonnet on standard and internal benchmarks while providing a 20-32 times larger context window.

+

10. We have made MiniMax-01 publicly available at https://github.

+

11. com/MiniMax-AI.

+

Vocabulary

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WordPinyinTranslation
介绍jiè shàointroduce
系列xì lièseries
模型mó xíngmodel
处理chǔ lǐprocess
上下文shàng xià wéncontext
卓越zhuó yuèoutstanding
能力néng lìability
核心hé xīncore
闪电shǎn diànlightning
注意力zhù yì lìattention
高效gāo xiàoefficient
扩展kuò zhǎnexpand
混合hùn héhybrid
专家zhuān jiāexpert
集成jí chéngintegrate
并行bìng xíngparallel
策略cè lüèstrategy
通信tōng xìncommunication
重叠chóng diéoverlap
技术jì shùtechnology
训练xùn liàntrain
推理tuī lǐinference
窗口chuāng kǒuwindow
标记biāo jìtoken
视觉shì juévisual
语言yǔ yánlanguage
持续chí xùcontinuous
实验shí yànexperiment
性能xìng néngperformance
基准jī zhǔnbenchmark
公开gōng kāipublic
发布fā bùrelease
+
+ + + \ No newline at end of file diff --git a/d/2025-01-16.html b/d/2025-01-16.html new file mode 100644 index 00000000..5ee9ef6d --- /dev/null +++ b/d/2025-01-16.html @@ -0,0 +1,1270 @@ + + + + + + + + HF. 18 papers. January 15. + + + + + + + +
+
+

🔺

hf daily

+

15 января | 18 papers

+
+
+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+
+ 🏷️ Фильтр + + + +
+
+
+ + +
+
+
+ 🧹 + +
+
+ +
+
+ + + + + \ No newline at end of file diff --git a/d/2025-01-16.json b/d/2025-01-16.json new file mode 100644 index 00000000..ada2814f --- /dev/null +++ b/d/2025-01-16.json @@ -0,0 +1,1112 @@ +{ + "date": { + "ru": "15 января", + "en": "January 15", + "zh": "1月15日" + }, + "time_utc": "2025-01-15 23:09", + "weekday": 2, + "issue_id": 1692, + "home_page_url": "https://huggingface.co/papers", + "papers": [ + { + "id": "https://huggingface.co/papers/2501.08313", + "title": "MiniMax-01: Scaling Foundation Models with Lightning Attention", + "url": "https://huggingface.co/papers/2501.08313", + "abstract": "We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.", + "score": 193, + "issue_id": 1672, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "a57d7b1914e7383a", + "authors": [ + "MiniMax", + "Aonian Li", + "Bangwei Gong", + "Bo Yang", + "Boji Shan", + "Chang Liu", + "Cheng Zhu", + "Chunhao Zhang", + "Congchao Guo", + "Da Chen", + "Dong Li", + "Enwei Jiao", + "Gengxin Li", + "Guojun Zhang", + "Haohai Sun", + "Houze Dong", + "Jiadai Zhu", + "Jiaqi Zhuang", + "Jiayuan Song", + "Jin Zhu", + "Jingtao Han", + "Jingyang Li", + "Junbin Xie", + "Junhao Xu", + "Junjie Yan", + "Kaishun Zhang", + "Kecheng Xiao", + "Kexi Kang", + "Le Han", + "Leyang Wang", + "Lianfei Yu", + "Liheng Feng", + "Lin Zheng", + "Linbo Chai", + "Long Xing", + "Meizhi Ju", + "Mingyuan Chi", + "Mozhi Zhang", + "Peikai Huang", + "Pengcheng Niu", + "Pengfei Li", + "Pengyu Zhao", + "Qi Yang", + "Qidi Xu", + "Qiexiang Wang", + "Qin Wang", + "Qiuhui Li", + "Ruitao Leng", + "Shengmin Shi", + "Shuqi Yu", + "Sichen Li", + "Songquan Zhu", + "Tao Huang", + "Tianrun Liang", + "Weigao Sun", + "Weixuan Sun", + "Weiyu Cheng", + "Wenkai Li", + "Xiangjun Song", + "Xiao Su", + "Xiaodong Han", + "Xinjie Zhang", + "Xinzhu Hou", + "Xu Min", + "Xun Zou", + "Xuyang Shen", + "Yan Gong", + "Yingjie Zhu", + "Yipeng Zhou", + "Yiran Zhong", + "Yongyi Hu", + "Yuanxiang Fan", + "Yue Yu", + "Yufeng Yang", + "Yuhao Li", + "Yunan Huang", + "Yunji Li", + "Yunpeng Huang", + "Yunzhi Xu", + "Yuxin Mao", + "Zehan Li", + "Zekang Li", + "Zewei Tao", + "Zewen Ying", + "Zhaoyang Cong", + "Zhen Qin", + "Zhenhua Fan", + "Zhihang Yu", + "Zhuo Jiang", + "Zijia Wu" + ], + "affiliations": [], + "pdf_title_img": "assets/pdf/title_img/2501.08313.jpg", + "data": { + "categories": [ + "#open_source", + "#architecture", + "#optimization", + "#benchmark", + "#long_context", + "#training" + ], + "emoji": "🚀", + "ru": { + "title": "MiniMax-01: Революция в обработке длинных контекстов", + "desc": "Исследователи представили серию моделей MiniMax-01, включая MiniMax-Text-01 и MiniMax-VL-01, которые сравнимы с лучшими моделями, но обладают улучшенными возможностями обработки длинных контекстов. В основе лежит технология lightning attention и ее эффективное масштабирование, интегрированные с Mixture of Experts (MoE). Модель имеет 32 эксперта и 456 миллиардов параметров, из которых 45,9 миллиардов активируются для каждого токена. Контекстное окно MiniMax-Text-01 может достигать 1 миллиона токенов при обучении и экстраполироваться до 4 миллионов токенов при инференсе." + }, + "en": { + "title": "Unleashing Long Contexts with MiniMax-01 Models", + "desc": "The MiniMax-01 series introduces advanced models, MiniMax-Text-01 and MiniMax-VL-01, designed to handle longer contexts effectively. These models utilize lightning attention and a Mixture of Experts (MoE) architecture, featuring 32 experts and a staggering 456 billion parameters, optimizing the activation of 45.9 billion parameters per token. By implementing efficient parallel strategies and computation-communication overlap techniques, the models can train and infer on extensive datasets, reaching context windows of up to 1 million tokens during training and 4 million during inference. Performance evaluations indicate that MiniMax-01 models rival leading models like GPT-4o and Claude-3.5-Sonnet while significantly extending context capabilities." + }, + "zh": { + "title": "MiniMax-01:超长上下文处理的新纪元", + "desc": "我们介绍了MiniMax-01系列,包括MiniMax-Text-01和MiniMax-VL-01,这些模型在处理更长的上下文时具有优越的能力。核心技术是闪电注意力和高效的扩展能力。为了最大化计算能力,我们将其与专家混合模型(MoE)结合,创建了一个拥有32个专家和4560亿参数的模型。我们的实验表明,这些模型在标准和内部基准测试中表现出色,能够与最先进的模型相媲美,同时提供20到32倍更长的上下文窗口。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08332", + "title": "MangaNinja: Line Art Colorization with Precise Reference Following", + "url": "https://huggingface.co/papers/2501.08332", + "abstract": "Derived from diffusion models, MangaNinjia specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases, cross-character colorization, multi-reference harmonization, beyond the reach of existing algorithms.", + "score": 31, + "issue_id": 1673, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "20ea6b75639e2ced", + "authors": [ + "Zhiheng Liu", + "Ka Leong Cheng", + "Xi Chen", + "Jie Xiao", + "Hao Ouyang", + "Kai Zhu", + "Yu Liu", + "Yujun Shen", + "Qifeng Chen", + "Ping Luo" + ], + "affiliations": [ + "Ant Group", + "HKU", + "HKUST", + "Tongyi Lab" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08332.jpg", + "data": { + "categories": [ + "#cv", + "#diffusion", + "#benchmark" + ], + "emoji": "🎨", + "ru": { + "title": "Прецизионное раскрашивание манги с помощью ИИ", + "desc": "MangaNinjia - это модель для раскрашивания линейных рисунков манги, основанная на диффузионных моделях. Она использует модуль перемешивания патчей для обучения соответствиям между цветным изображением-образцом и целевым линейным рисунком. Модель также включает схему точечного контроля для точного подбора цветов. Эксперименты показывают превосходство MangaNinjia над существующими решениями в точности раскрашивания." + }, + "en": { + "title": "MangaNinjia: Mastering Line Art Colorization with Precision", + "desc": "MangaNinjia is a model designed for coloring line art by using reference images. It employs a patch shuffling module to help the model learn how to match colors from the reference image to the target line art accurately. Additionally, it features a point-driven control scheme that allows for detailed color adjustments, ensuring that colors are applied precisely. Our experiments show that MangaNinjia outperforms existing methods in colorization tasks, especially in complex scenarios involving multiple references and different characters." + }, + "zh": { + "title": "MangaNinjia:精准上色的新方法", + "desc": "MangaNinjia 是一种基于扩散模型的参考引导线条艺术上色技术。我们设计了两个模块来确保角色细节的准确转录,包括补丁洗牌模块和点驱动控制方案,以实现精细的颜色匹配。实验结果表明,我们的模型在精确上色方面优于现有解决方案。我们还展示了所提议的交互式点控制在处理复杂案例和多参考协调方面的潜力,超越了现有算法的能力。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.06751", + "title": "Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models", + "url": "https://huggingface.co/papers/2501.06751", + "abstract": "Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model's architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.", + "score": 27, + "issue_id": 1677, + "pub_date": "2025-01-12", + "pub_date_card": { + "ru": "12 января", + "en": "January 12", + "zh": "1月12日" + }, + "hash": "05733e8e82e23568", + "authors": [ + "Michael Toker", + "Ido Galil", + "Hadas Orgad", + "Rinon Gal", + "Yoad Tewel", + "Gal Chechik", + "Yonatan Belinkov" + ], + "affiliations": [ + "Bar-Ilan University", + "NVIDIA", + "Technion Israel Institute of Technology" + ], + "pdf_title_img": "assets/pdf/title_img/2501.06751.jpg", + "data": { + "categories": [ + "#cv", + "#architecture", + "#interpretability", + "#diffusion", + "#training" + ], + "emoji": "🧩", + "ru": { + "title": "Раскрытие тайн токенов заполнения в генерации изображений", + "desc": "Исследователи провели первый глубокий анализ роли токенов заполнения в моделях преобразования текста в изображение (T2I). Они разработали две причинно-следственные техники для изучения того, как информация кодируется в представлении токенов в различных компонентах конвейера T2I. Результаты показали три различных сценария влияния токенов заполнения на процесс генерации изображений. Исследование выявило ключевые взаимосвязи между этими сценариями и архитектурой модели, а также процессом ее обучения." + }, + "en": { + "title": "Unpacking Padding: The Hidden Role in Text-to-Image Models", + "desc": "This paper explores the impact of padding tokens in text-to-image (T2I) diffusion models, which are used to generate images from text prompts. The authors analyze how these padding tokens influence the image generation process at different stages, including text encoding and the diffusion process. They identify three scenarios where padding tokens can either affect the output or be ignored, depending on the model's architecture and training methods. The findings provide valuable insights that could guide future improvements in T2I model design and training practices." + }, + "zh": { + "title": "填充标记在图像生成中的关键作用", + "desc": "本文研究了文本到图像(T2I)扩散模型中填充标记的作用。填充标记通常用于将提示扩展到固定长度,但其对图像生成过程的影响尚未被深入探讨。我们开发了两种因果分析技术,探讨填充标记在T2I模型不同组件中的信息编码方式。研究结果表明,填充标记在文本编码、扩散过程中的影响各不相同,并与模型架构和训练过程存在重要关系。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08316", + "title": "Diffusion Adversarial Post-Training for One-Step Video Generation", + "url": "https://huggingface.co/papers/2501.08316", + "abstract": "The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.", + "score": 19, + "issue_id": 1672, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "4122a780e8356ce7", + "authors": [ + "Shanchuan Lin", + "Xin Xia", + "Yuxi Ren", + "Ceyuan Yang", + "Xuefeng Xiao", + "Lu Jiang" + ], + "affiliations": [ + "ByteDance" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08316.jpg", + "data": { + "categories": [ + "#architecture", + "#optimization", + "#video", + "#diffusion", + "#training" + ], + "emoji": "🎬", + "ru": { + "title": "Революция в генерации видео: от итераций к мгновенному результату", + "desc": "Эта статья представляет новый метод под названием Adversarial Post-Training (APT) для одношаговой генерации видео. Авторы предлагают улучшения архитектуры модели и процедур обучения, включая аппроксимированную регуляризацию R1. Их модель Seaweed-APT способна генерировать 2-секундные видео высокого разрешения в реальном времени за один проход. Кроме того, модель может создавать изображения размером 1024px за один шаг, достигая качества, сравнимого с современными методами." + }, + "en": { + "title": "Fast and High-Quality Video Generation with Seaweed-APT", + "desc": "This paper addresses the slow and costly iterative process of generating images and videos using diffusion models. The authors introduce Adversarial Post-Training (APT) to enhance one-step video generation while maintaining high quality. They implement architectural and procedural improvements, including an approximated R1 regularization, to stabilize training. Their model, Seaweed-APT, successfully generates high-quality 2-second videos and 1024px images in real time with a single forward evaluation step." + }, + "zh": { + "title": "对抗后训练:快速高质量视频生成的新方法", + "desc": "扩散模型广泛应用于图像和视频生成,但其迭代生成过程较慢且成本高昂。现有的蒸馏方法在图像领域展示了单步生成的潜力,但仍存在显著的质量下降。本文提出了一种针对真实数据的对抗后训练(APT)方法,以实现单步视频生成。我们的实验表明,经过对抗后训练的模型Seaweed-APT能够实时生成1280x720、24fps的2秒视频,并且在单步生成1024px图像时,其质量可与最先进的方法相媲美。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08187", + "title": "A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following", + "url": "https://huggingface.co/papers/2501.08187", + "abstract": "Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the \"language of cellular biology\", capturing intricate gene expression patterns at the single-cell level. However, interacting with this \"language\" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.", + "score": 19, + "issue_id": 1672, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "de984ce7cc62fa5e", + "authors": [ + "Yin Fang", + "Xinle Deng", + "Kangwei Liu", + "Ningyu Zhang", + "Jingyang Qian", + "Penghui Yang", + "Xiaohui Fan", + "Huajun Chen" + ], + "affiliations": [ + "College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China", + "College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China", + "Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China", + "Innovation Center in Zhejiang University, State Key Laboratory of Component-Based Chinese Medicine, Hangzhou 310058, China", + "School of Software Technology, Zhejiang University, Ningbo 315048, China", + "ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou 311200, China" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08187.jpg", + "data": { + "categories": [ + "#architecture", + "#multimodal", + "#dataset", + "#science", + "#healthcare" + ], + "emoji": "🧬", + "ru": { + "title": "Естественный язык как ключ к расшифровке клеточной биологии", + "desc": "InstructCell - это мультимодальный ИИ-помощник для анализа данных одноклеточного РНК-секвенирования (scRNA-seq). Он использует архитектуру, способную интерпретировать как естественный язык, так и профили экспрессии генов. InstructCell позволяет исследователям выполнять такие задачи, как аннотация типов клеток и предсказание чувствительности к лекарствам, с помощью простых текстовых команд. Модель демонстрирует высокую производительность и адаптивность к различным экспериментальным условиям." + }, + "en": { + "title": "InstructCell: Bridging Language and Biology for Seamless Single-Cell Analysis", + "desc": "This paper introduces InstructCell, an AI tool designed to simplify the analysis of single-cell RNA sequencing (scRNA-seq) data using natural language instructions. By creating a dataset that links text commands with scRNA-seq profiles, InstructCell allows researchers to perform complex tasks like cell type annotation and drug sensitivity prediction more intuitively. The model employs a multi-modal architecture that processes both text and biological data simultaneously, enhancing its usability. Evaluations show that InstructCell outperforms existing models, making single-cell analysis more accessible and efficient for researchers in the life sciences." + }, + "zh": { + "title": "用自然语言解锁单细胞数据的潜力", + "desc": "这篇论文介绍了InstructCell,一个多模态的人工智能助手,旨在通过自然语言简化单细胞RNA测序(scRNA-seq)数据的分析。传统工具在处理细胞生物学的复杂数据时效率低下,而InstructCell通过将文本指令与scRNA-seq数据结合,提供了更直接和灵活的分析方式。该系统能够执行细胞类型注释、条件伪细胞生成和药物敏感性预测等关键任务,且使用简单的自然语言命令即可完成。评估结果表明,InstructCell在性能上优于现有的单细胞基础模型,同时适应多种实验条件,降低了技术门槛,促进了生物学的深入理解。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08326", + "title": "Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks", + "url": "https://huggingface.co/papers/2501.08326", + "abstract": "We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.", + "score": 12, + "issue_id": 1678, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "463580cacfaa6789", + "authors": [ + "Miran Heo", + "Min-Hung Chen", + "De-An Huang", + "Sifei Liu", + "Subhashree Radhakrishnan", + "Seon Joo Kim", + "Yu-Chiang Frank Wang", + "Ryo Hachiuma" + ], + "affiliations": [ + "NVIDIA", + "Yonsei University" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08326.jpg", + "data": { + "categories": [ + "#multimodal", + "#reasoning", + "#agi", + "#cv", + "#dataset", + "#video", + "#benchmark" + ], + "emoji": "🎥", + "ru": { + "title": "Omni-RGPT: Новый уровень понимания изображений и видео искусственным интеллектом", + "desc": "Omni-RGPT - это мультимодальная большая языковая модель, разработанная для понимания изображений и видео на уровне регионов. Модель использует технологию Token Mark для выделения целевых регионов в визуальном пространстве признаков. Для улучшения понимания видео без необходимости трекинга объектов введена вспомогательная задача, использующая согласованность токенов. Авторы также представили большой набор данных RegVID-300k для обучения на видео с инструкциями на уровне регионов." + }, + "en": { + "title": "Omni-RGPT: Bridging Visual and Textual Understanding with Token Mark", + "desc": "Omni-RGPT is a multimodal large language model that enhances understanding of specific regions in images and videos. It uses a novel approach called Token Mark, which embeds tokens into visual features to highlight target areas, linking them with text prompts. This model also includes an auxiliary task that ensures consistent token representation across video frames, improving video comprehension. With the introduction of the RegVID-300k dataset, Omni-RGPT sets new benchmarks in commonsense reasoning, captioning, and referring expression tasks." + }, + "zh": { + "title": "Omni-RGPT:图像与视频的区域理解新突破", + "desc": "本文介绍了Omni-RGPT,这是一种多模态的大型语言模型,旨在促进图像和视频的区域级理解。为了在时空维度上实现一致的区域表示,我们引入了Token Mark,这是一组突出视觉特征空间中目标区域的标记。通过使用区域提示(如框或掩码),这些标记被直接嵌入到空间区域中,并同时与文本提示结合,以指定目标,从而建立视觉和文本标记之间的直接联系。此外,我们还引入了一个辅助任务,通过利用标记的一致性来指导Token Mark,从而支持稳健的视频理解。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08225", + "title": "FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors", + "url": "https://huggingface.co/papers/2501.08225", + "abstract": "Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \\eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \\eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.", + "score": 12, + "issue_id": 1673, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "811cfd0f18eb1e53", + "authors": [ + "Yabo Zhang", + "Xinpeng Zhou", + "Yihan Zeng", + "Hang Xu", + "Hui Li", + "Wangmeng Zuo" + ], + "affiliations": [ + "Harbin Institute of Technology", + "Huawei Noahs Ark Lab" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08225.jpg", + "data": { + "categories": [ + "#video", + "#cv", + "#optimization", + "#diffusion" + ], + "emoji": "🎨", + "ru": { + "title": "FramePainter: эффективное редактирование изображений через генерацию видео", + "desc": "Статья представляет FramePainter - новый подход к интерактивному редактированию изображений, основанный на генерации видео. В отличие от существующих методов, использующих модели диффузии текст-изображение, FramePainter опирается на мощные видео-диффузионные модели для обеспечения временной согласованности и снижения затрат на обучение. Метод использует легковесный энкодер для внедрения сигналов редактирования и вводит механизм согласованного внимания для улучшения обработки крупных движений между кадрами. FramePainter превосходит современные методы, требуя значительно меньше обучающих данных и демонстрируя высокую обобщающую способность." + }, + "en": { + "title": "Revolutionizing Image Editing with Efficient Video Diffusion", + "desc": "This paper presents FramePainter, a novel approach to interactive image editing that reformulates the task as image-to-video generation. By leveraging video diffusion models, FramePainter reduces the need for extensive training data while ensuring temporal consistency in edited images. It utilizes a lightweight sparse control encoder to effectively incorporate editing signals, and introduces matching attention to improve the handling of large motion between frames. The results demonstrate that FramePainter significantly outperforms existing methods, achieving seamless image edits and showcasing strong generalization capabilities." + }, + "zh": { + "title": "FramePainter:高效的图像编辑新方法", + "desc": "本文提出了一种交互式图像编辑的新方法,称为FramePainter。该方法将图像编辑任务重新定义为图像到视频的生成问题,从而利用强大的视频扩散先验,降低训练成本并确保时间一致性。FramePainter使用轻量级的稀疏控制编码器来注入编辑信号,并通过匹配注意力机制增强了对大运动的处理能力。实验结果表明,FramePainter在各种编辑信号下表现优异,能够实现无缝且连贯的图像编辑,且在未见过的场景中也展现出卓越的泛化能力。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.07730", + "title": "Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens", + "url": "https://huggingface.co/papers/2501.07730", + "abstract": "Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.", + "score": 10, + "issue_id": 1673, + "pub_date": "2025-01-13", + "pub_date_card": { + "ru": "13 января", + "en": "January 13", + "zh": "1月13日" + }, + "hash": "80f40715084c602b", + "authors": [ + "Dongwon Kim", + "Ju He", + "Qihang Yu", + "Chenglin Yang", + "Xiaohui Shen", + "Suha Kwak", + "Liang-Chieh Chen" + ], + "affiliations": [ + "ByteDance Seed", + "POSTECH" + ], + "pdf_title_img": "assets/pdf/title_img/2501.07730.jpg", + "data": { + "categories": [ + "#dataset", + "#data", + "#training", + "#cv", + "#open_source" + ], + "emoji": "🖼️", + "ru": { + "title": "Демократизация генерации изображений с помощью эффективной токенизации и открытых данных", + "desc": "В этой статье представлен новый подход к токенизации изображений для генеративных моделей текст-в-изображение под названием TA-TiTok. Данный токенизатор использует одномерные токены и интегрирует текстовую информацию на этапе детокенизации, что ускоряет сходимость и улучшает производительность. На основе TA-TiTok авторы разработали семейство моделей MaskGen, обученных исключительно на открытых данных. Целью работы является демократизация области генеративных моделей текст-в-изображение путем публикации эффективных токенизаторов и моделей с открытыми весами." + }, + "en": { + "title": "Democratizing Text-to-Image Generation with TA-TiTok", + "desc": "This paper presents TA-TiTok, a novel image tokenizer designed for text-to-image generative models, which simplifies the training process and improves performance. Unlike traditional models that require large private datasets, TA-TiTok can effectively utilize open data, making it more accessible for researchers. The tokenizer incorporates textual information during the decoding stage, which helps it learn faster and perform better. Additionally, the authors introduce MaskGen, a family of generative models that leverage TA-TiTok and are trained on publicly available datasets, aiming to democratize access to advanced text-to-image generation technology." + }, + "zh": { + "title": "高效的文本到图像生成模型,推动开放数据的使用", + "desc": "本文介绍了一种新的图像标记器,称为TA-TiTok,它可以有效地处理文本到图像的生成任务。TA-TiTok在解码阶段整合了文本信息,从而加快了模型的收敛速度并提高了性能。与以往的标记器不同,TA-TiTok采用了一种简化的一阶段训练过程,避免了复杂的两阶段蒸馏过程。我们还提出了一系列基于开放数据训练的文本到图像生成模型MaskGen,旨在促进更广泛的访问和民主化。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.05131", + "title": "3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering", + "url": "https://huggingface.co/papers/2501.05131", + "abstract": "The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.", + "score": 9, + "issue_id": 1684, + "pub_date": "2025-01-09", + "pub_date_card": { + "ru": "9 января", + "en": "January 9", + "zh": "1月9日" + }, + "hash": "ca5ad23cb146f3aa", + "authors": [ + "Dewei Zhou", + "Ji Xie", + "Zongxin Yang", + "Yi Yang" + ], + "affiliations": [ + "DBMI, HMS, Harvard University", + "RELER, CCAI, Zhejiang University" + ], + "pdf_title_img": "assets/pdf/title_img/2501.05131.jpg", + "data": { + "categories": [ + "#cv", + "#games", + "#architecture", + "#multimodal", + "#optimization" + ], + "emoji": "🎨", + "ru": { + "title": "3DIS-FLUX: Новый уровень контролируемой генерации мульти-объектных изображений", + "desc": "Статья представляет метод 3DIS-FLUX для управляемой генерации изображений с несколькими объектами. Этот подход разделяет процесс на создание сцены на основе глубины и детализированный рендеринг с использованием предобученных моделей контроля глубины. 3DIS-FLUX интегрирует модель FLUX для улучшенного рендеринга, манипулируя маской внимания в механизме совместного внимания FLUX. Эксперименты показывают, что 3DIS-FLUX превосходит предыдущие методы по производительности и качеству изображений." + }, + "en": { + "title": "Enhancing Text-to-Image Generation with 3DIS-FLUX", + "desc": "This paper introduces a new method called 3DIS-FLUX for improving text-to-image generation by enhancing the multi-instance generation (MIG) process. The 3DIS framework separates the generation into two phases: constructing the scene based on depth and rendering details using pre-trained models. By integrating the FLUX model, the method allows for better control over the rendering of fine details while reducing the need for retraining adapters. Experimental results show that 3DIS-FLUX outperforms previous methods in both performance and image quality, making it a significant advancement in controllable image generation." + }, + "zh": { + "title": "深度驱动解耦实例合成:提升图像生成的可控性与质量", + "desc": "随着对可控文本到图像生成输出的需求增加,多实例生成(MIG)技术得到了显著进展。现有的MIG方法主要基于适配器,但每次新模型发布时都需要重新训练适配器,消耗大量资源。本文提出了一种名为深度驱动解耦实例合成(3DIS)的方法,将MIG分为两个阶段:基于深度的场景构建和细节渲染。通过引入FLUX模型,3DIS-FLUX在细节渲染方面实现了更高的性能和图像质量。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08328", + "title": "PokerBench: Training Large Language Models to become Professional Poker Players", + "url": "https://huggingface.co/papers/2501.08328", + "abstract": "We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.", + "score": 9, + "issue_id": 1674, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "7b4dacedffdbfa15", + "authors": [ + "Richard Zhuang", + "Akshat Gupta", + "Richard Yang", + "Aniket Rahane", + "Zhengyu Li", + "Gopala Anumanchipalli" + ], + "affiliations": [ + "Georgia Institute of Technology", + "University of California, Berkeley" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08328.jpg", + "data": { + "categories": [ + "#training", + "#reasoning", + "#games", + "#optimization", + "#benchmark" + ], + "emoji": "🃏", + "ru": { + "title": "PokerBench: новый рубеж для оценки стратегических способностей языковых моделей", + "desc": "PokerBench - это новый бенчмарк для оценки способностей больших языковых моделей (LLM) играть в покер. Он включает 11000 важнейших сценариев игры, разработанных совместно с профессиональными игроками. Авторы оценили производительность современных LLM, таких как GPT-4 и ChatGPT 3.5, обнаружив, что все модели показывают результаты ниже оптимальных. После дообучения модели демонстрируют значительное улучшение, но авторы отмечают ограничения простого обучения с учителем для освоения оптимальной стратегии игры." + }, + "en": { + "title": "PokerBench: Elevating LLMs to Master the Game of Poker", + "desc": "PokerBench is a new benchmark designed to assess the poker-playing skills of large language models (LLMs). It focuses on the unique challenges of poker, which requires a blend of mathematical skills, strategic reasoning, and an understanding of human psychology. The benchmark includes 11,000 scenarios that cover various aspects of the game, and it has been tested on several leading models, revealing that they initially struggle with optimal poker play. However, after fine-tuning, these models show significant improvement, highlighting the need for advanced training techniques to enhance their performance in complex games." + }, + "zh": { + "title": "PokerBench:评估语言模型扑克能力的新基准", + "desc": "我们介绍了PokerBench,这是一个用于评估大型语言模型(LLMs)扑克游戏能力的基准。扑克是一种不完全信息游戏,需要数学、推理、规划、策略以及对博弈论和人类心理的深刻理解。PokerBench包含11,000个重要场景,分为翻牌前和翻牌后游戏,经过训练的扑克玩家共同开发。通过对不同模型的评估,我们发现尽管当前的LLMs在扑克游戏中表现不佳,但经过微调后,它们的表现有显著提升。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08319", + "title": "Enhancing Automated Interpretability with Output-Centric Feature Descriptions", + "url": "https://huggingface.co/papers/2501.08319", + "abstract": "Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary \"unembedding\" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be \"dead\".", + "score": 7, + "issue_id": 1677, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "22615e3bb16f93af", + "authors": [ + "Yoav Gur-Arieh", + "Roy Mayan", + "Chen Agassy", + "Atticus Geiger", + "Mor Geva" + ], + "affiliations": [ + "Blavatnik School of Computer Science and AI, Tel Aviv University", + "Pr(Ai)2R Group" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08319.jpg", + "data": { + "categories": [ + "#interpretability", + "#inference", + "#training", + "#data" + ], + "emoji": "🔍", + "ru": { + "title": "Взгляд изнутри: новый метод интерпретации больших языковых моделей", + "desc": "Статья описывает новый подход к автоматической интерпретации нейронных сетей, фокусируясь на выходных данных модели вместо входных. Авторы предлагают эффективные методы для генерации описаний признаков, основанные на токенах с наибольшим весом после стимуляции признака. Эксперименты показывают, что ориентированные на выход описания лучше отражают причинно-следственное влияние признака на результаты модели. Комбинация подходов, ориентированных на вход и выход, дает наилучшие результаты в оценке как входных, так и выходных данных." + }, + "en": { + "title": "Unlocking Feature Interpretability in Language Models", + "desc": "This paper discusses how automated interpretability pipelines can create natural language descriptions for features in large language models (LLMs). It highlights the challenge of identifying inputs that activate these features, which is essential for understanding their role in model behavior. The authors propose new methods that focus on the output effects of features, leading to more accurate descriptions of their causal impact. By combining both input-centric and output-centric approaches, the proposed methods improve the overall interpretability of LLMs and can even identify previously overlooked features." + }, + "zh": { + "title": "以输出为中心的特征描述生成方法", + "desc": "这篇论文讨论了自动化可解释性管道如何为大型语言模型中的特征生成自然语言描述。特征的描述是通过激活特征的输入生成的,但识别这些输入的过程成本高昂。研究表明,现有的描述方法未能有效捕捉特征对输出的因果影响。为此,作者提出了一种以输出为中心的方法,能够更好地生成特征描述,并结合输入和输出的评估来提高性能。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08197", + "title": "OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training", + "url": "https://huggingface.co/papers/2501.08197", + "abstract": "Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.", + "score": 5, + "issue_id": 1675, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "27267ae1a569051c", + "authors": [ + "Yijiong Yu", + "Ziyun Dai", + "Zekun Wang", + "Wei Wang", + "Ran Chen", + "Ji Pei" + ], + "affiliations": [ + "OpenCSG", + "Tsinghua University" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08197.jpg", + "data": { + "categories": [ + "#data", + "#open_source", + "#dataset", + "#synthetic", + "#training", + "#low_resource" + ], + "emoji": "🐉", + "ru": { + "title": "Прорыв в обучении китайских языковых моделей: OpenCSG Chinese Corpus", + "desc": "Эта статья представляет OpenCSG Chinese Corpus - набор высококачественных китайских датасетов для предобучения, пост-обучения и тонкой настройки больших языковых моделей (LLM). Корпус включает в себя несколько датасетов, каждый с уникальными характеристиками: от отфильтрованного веб-контента до синтетических учебных данных и разговорных форматов. Авторы подчеркивают высокое качество текста, разнообразие тематик и масштабируемость процесса сбора данных. Эксперименты показали значительное улучшение производительности моделей на различных задачах, включая C-Eval." + }, + "en": { + "title": "Empowering Chinese LLMs with OpenCSG Corpus", + "desc": "This paper introduces the OpenCSG Chinese Corpus, a collection of high-quality datasets aimed at improving the performance of Chinese large language models (LLMs). The corpus includes several datasets, each tailored for different training needs: Fineweb-edu datasets focus on high-quality web content, Cosmopedia-chinese offers synthetic textbook-style data, and Smoltalk-chinese provides diverse chat-format data. The authors highlight the importance of quality pretraining data for LLMs and demonstrate through experiments that using this corpus leads to significant performance gains in various evaluation tasks. Overall, the OpenCSG Chinese Corpus addresses the challenge of limited high-quality datasets for Chinese LLMs, promoting better training outcomes." + }, + "zh": { + "title": "提升中文LLM性能的高质量语料库", + "desc": "大型语言模型(LLMs)在处理自然语言方面表现出色,但其成功依赖于高质量的预训练语料库。针对中文LLMs,优质中文数据集的稀缺性成为了一个重大挑战,限制了它们的性能。为了解决这个问题,我们提出了OpenCSG中文语料库,这是一系列专门为LLM预训练、后训练和微调设计的高质量数据集。该语料库包括Fineweb-edu-chinese、Fineweb-edu-chinese-v2、Cosmopedia-chinese和Smoltalk-chinese,涵盖了多样化的内容和风格,显著提升了中文LLMs的训练效果。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08167", + "title": "Potential and Perils of Large Language Models as Judges of Unstructured Textual Data", + "url": "https://huggingface.co/papers/2501.08167", + "abstract": "Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.", + "score": 5, + "issue_id": 1675, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "866161709624c632", + "authors": [ + "Rewina Bedemariam", + "Natalie Perez", + "Sreyoshi Bhaduri", + "Satya Kapoor", + "Alex Gil", + "Elizabeth Conjar", + "Ikkei Itoku", + "David Theil", + "Aman Chadha", + "Naumaan Nayyar" + ], + "affiliations": [], + "pdf_title_img": "assets/pdf/title_img/2501.08167.jpg", + "data": { + "categories": [ + "#data", + "#dataset", + "#science", + "#ethics", + "#multimodal", + "#benchmark", + "#interpretability" + ], + "emoji": "🤖", + "ru": { + "title": "LLM как судьи: масштабируемая альтернатива человеческим оценкам в анализе текста", + "desc": "Исследование посвящено использованию больших языковых моделей (LLM) для анализа неструктурированных текстовых данных, таких как ответы на опросы. Авторы изучают эффективность применения LLM в качестве судей для оценки тематического соответствия сгенерированных другими LLM резюме. Результаты показывают, что LLM-судьи предлагают масштабируемое решение, сопоставимое с оценками людей, хотя люди все еще могут превосходить их в обнаружении тонких, контекстно-зависимых нюансов. Исследование вносит вклад в растущий объем знаний об анализе текста с помощью искусственного интеллекта." + }, + "en": { + "title": "Trusting AI: Evaluating LLMs for Accurate Text Analysis", + "desc": "This paper explores the use of large language models (LLMs) for summarizing and analyzing unstructured text data, particularly from open-ended survey responses. It raises concerns about the trustworthiness of LLM-generated summaries, as they may not accurately reflect the original sentiments and themes present in the data. The research introduces an LLM-as-judge framework, where one LLM generates summaries while others evaluate their thematic alignment, comparing this method to human evaluations. The findings suggest that while LLMs can provide a scalable alternative to human raters, they may struggle with detecting subtle nuances that humans can identify, highlighting the importance of careful application in different contexts." + }, + "zh": { + "title": "信任大型语言模型的总结能力吗?", + "desc": "这篇论文探讨了大型语言模型(LLMs)在处理和总结非结构化文本数据方面的能力,尤其是在分析开放式调查反馈时的应用。研究表明,虽然LLMs能够生成类似人类的总结,但它们的输出可能与原始文本的真实主题存在偏差,这可能导致错误的决策。为了评估LLMs生成的总结与实际主题的一致性,研究使用了LLMs作为评判模型,并与人类评估进行了比较。结果显示,LLMs作为评判者提供了一种可扩展的解决方案,但人类在捕捉细微的上下文特征方面仍然表现更佳。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.07888", + "title": "Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding", + "url": "https://huggingface.co/papers/2501.07888", + "abstract": "We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\\% over GPT-4o and 5.8\\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\\% performance advantage over GPT-4o and +24.9\\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.", + "score": 5, + "issue_id": 1674, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "54780a4b6f93fb10", + "authors": [ + "Liping Yuan", + "Jiawei Wang", + "Haomiao Sun", + "Yuchen Zhang", + "Yuan Lin" + ], + "affiliations": [ + "ByteDance Research" + ], + "pdf_title_img": "assets/pdf/title_img/2501.07888.jpg", + "data": { + "categories": [ + "#dataset", + "#training", + "#cv", + "#hallucinations", + "#optimization", + "#video", + "#benchmark" + ], + "emoji": "🎥", + "ru": { + "title": "Tarsier2: Революция в понимании видео искусственным интеллектом", + "desc": "Tarsier2 - это современная крупномасштабная модель для понимания видео и языка (LVLM), разработанная для создания детальных и точных описаний видео. Модель достигает значительных улучшений благодаря увеличению объема обучающих данных, точной временной синхронизации при тонкой настройке и применению обучения с предпочтениями (DPO). Tarsier2-7B превосходит ведущие проприетарные модели, такие как GPT-4o и Gemini 1.5 Pro, в задачах детального описания видео. Модель также устанавливает новые рекорды в 15 публичных бенчмарках, демонстрируя свою универсальность как надежная модель общего назначения для понимания видео и языка." + }, + "en": { + "title": "Tarsier2: Redefining Video Understanding with Advanced LVLM Technology", + "desc": "Tarsier2 is a cutting-edge large vision-language model (LVLM) that excels in generating precise and detailed descriptions of videos while showcasing advanced video comprehension skills. The model's improvements stem from three main enhancements: increasing the pre-training dataset from 11 million to 40 million video-text pairs, implementing fine-grained temporal alignment during fine-tuning, and utilizing model-based sampling for preference data construction with DPO training for optimization. Extensive testing reveals that Tarsier2-7B surpasses top proprietary models like GPT-4o and Gemini 1.5 Pro in video description tasks, achieving notable F1 score improvements on the DREAM-1K benchmark. Additionally, Tarsier2-7B sets new records across 15 public benchmarks, proving its effectiveness in various tasks such as video question-answering and video grounding." + }, + "zh": { + "title": "Tarsier2:视频描述的新标杆", + "desc": "Tarsier2是一种先进的大型视觉语言模型,专门用于生成详细且准确的视频描述,同时具备出色的视频理解能力。该模型通过三个关键升级实现了显著进步:首先,预训练数据从1100万对视频文本扩展到4000万对,增加了数据的数量和多样性;其次,在监督微调过程中进行精细的时间对齐;最后,采用基于模型的采样自动构建偏好数据,并应用DPO训练进行优化。实验结果表明,Tarsier2-7B在视频描述任务中持续超越领先的专有模型,展现出其作为强大通用视觉语言模型的多样性。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08292", + "title": "HALoGEN: Fantastic LLM Hallucinations and Where to Find Them", + "url": "https://huggingface.co/papers/2501.08292", + "abstract": "Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.", + "score": 5, + "issue_id": 1673, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "f6751d682ff824ed", + "authors": [ + "Abhilasha Ravichander", + "Shrusti Ghela", + "David Wadden", + "Yejin Choi" + ], + "affiliations": [ + "Google", + "NVIDIA", + "University of Washington" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08292.jpg", + "data": { + "categories": [ + "#dataset", + "#hallucinations", + "#benchmark" + ], + "emoji": "🔍", + "ru": { + "title": "HALoGEN: Автоматическая проверка галлюцинаций в языковых моделях", + "desc": "Эта статья представляет HALoGEN - комплексный инструмент для оценки галлюцинаций в больших языковых моделях (LLM). Авторы создали набор из 10,923 промптов в девяти различных областях и автоматические верификаторы высокой точности для проверки генераций LLM. Исследование выявило, что даже лучшие модели страдают от галлюцинаций, иногда до 86% сгенерированных фактов оказываются неверными. Авторы также предложили новую классификацию ошибок LLM, разделив их на три типа в зависимости от источника галлюцинаций." + }, + "en": { + "title": "HALoGEN: A Benchmark for Measuring Hallucinations in Language Models", + "desc": "This paper introduces HALoGEN, a new benchmark designed to measure hallucinations in generative large language models (LLMs). Hallucinations refer to incorrect statements generated by these models that do not align with known facts or the given context. The benchmark includes over 10,000 prompts across various domains and employs automatic verifiers to assess the accuracy of model outputs. The study reveals that even top-performing models exhibit significant hallucinations, prompting a classification system for different types of errors to better understand their origins and improve model reliability." + }, + "zh": { + "title": "揭示生成模型的幻觉问题", + "desc": "尽管生成性大型语言模型(LLMs)能够生成高质量和流畅的文本,但它们也会产生幻觉,即与已知世界知识或输入上下文不一致的陈述。测量幻觉的难度在于,实时验证模型生成的内容既昂贵又耗时。为此,我们推出了HALoGEN,这是一个全面的幻觉基准,包含10,923个跨越九个领域的提示和自动高精度验证器。我们的研究发现,即使是表现最好的模型,其生成的原子事实中也有高达86%可能存在幻觉,这为理解生成模型的幻觉提供了基础。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08284", + "title": "AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages", + "url": "https://huggingface.co/papers/2501.08284", + "abstract": "Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate", + "score": 3, + "issue_id": 1676, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "8c76dd102740009c", + "authors": [ + "Shamsuddeen Hassan Muhammad", + "Idris Abdulmumin", + "Abinew Ali Ayele", + "David Ifeoluwa Adelani", + "Ibrahim Said Ahmad", + "Saminu Mohammad Aliyu", + "Nelson Odhiambo Onyango", + "Lilian D. A. Wanzare", + "Samuel Rutunda", + "Lukman Jibril Aliyu", + "Esubalew Alemneh", + "Oumaima Hourrane", + "Hagos Tesfahun Gebremichael", + "Elyas Abdi Ismail", + "Meriem Beloucif", + "Ebrahim Chekol Jibril", + "Andiswa Bukula", + "Rooweither Mabuya", + "Salomey Osei", + "Abigail Oppong", + "Tadesse Destaw Belay", + "Tadesse Kebede Guge", + "Tesfa Tegegne Asfaw", + "Chiamaka Ijeoma Chukwuneke", + "Paul Röttger", + "Seid Muhie Yimam", + "Nedjma Ousidhoum" + ], + "affiliations": [ + "Addis Ababa University", + "Al Akhawayn University", + "Bahir Dar University", + "Bayero University Kano", + "Bocconi University", + "Cardiff University", + "DSFSI, University of Pretoria", + "Digital Umuganda", + "Haramaya University", + "HausaNLP", + "Imperial College London", + "Independent Researcher", + "Instituto Politécnico Nacional", + "Istanbul Technical University", + "Lancaster University", + "Maseno University", + "Mila, McGill University & Canada CIFAR AI Chair", + "Northeastern University", + "SADiLaR", + "University of Deusto", + "University of Hamburg", + "Uppsala University", + "Wollo University" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08284.jpg", + "data": { + "categories": [ + "#dataset", + "#ethics", + "#multilingual", + "#data", + "#low_resource", + "#open_source" + ], + "emoji": "🌍", + "ru": { + "title": "AfriHate: борьба с языком вражды в Африке с помощью локальных данных и экспертизы", + "desc": "Статья представляет AfriHate - многоязычный набор данных по языку вражды и оскорбительной лексике на 15 африканских языках. Датасет создан для решения проблемы недостатка качественных данных на местных языках и отсутствия вовлечения локальных сообществ в процессы сбора, разметки и модерации контента. Каждый пример в AfriHate размечен носителями языка, знакомыми с местной культурой. Авторы описывают трудности, связанные с созданием датасетов, и представляют результаты базовых классификационных моделей, в том числе с использованием языковых моделей." + }, + "en": { + "title": "Empowering Local Voices Against Hate Speech with AfriHate", + "desc": "This paper addresses the challenges of identifying and moderating hate speech in the Global South, particularly in African languages. It highlights the limitations of existing moderation techniques that rely on keyword spotting without cultural context, leading to ineffective censorship and oversight of targeted hate campaigns. To combat this, the authors introduce AfriHate, a multilingual dataset of hate speech and abusive language in 15 African languages, annotated by native speakers. The paper also discusses the difficulties faced during dataset construction and presents baseline classification results, demonstrating the potential of using large language models (LLMs) for this task." + }, + "zh": { + "title": "构建多语言仇恨言论数据集,助力社会文化理解", + "desc": "本论文介绍了AfriHate,这是一个包含15种非洲语言的仇恨言论和辱骂语言数据集。该数据集由熟悉当地文化的母语者进行标注,以解决全球南方地区在仇恨言论管理中的数据缺乏问题。研究还探讨了数据集构建过程中的挑战,并展示了使用和不使用大型语言模型(LLMs)进行分类的基线结果。所有数据集、标注和相关词汇表均可在指定网站上获取。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08120", + "title": "In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR", + "url": "https://huggingface.co/papers/2501.08120", + "abstract": "The pursuit of automated scientific discovery has fueled progress from symbolic logic to modern AI, forging new frontiers in reasoning and pattern recognition. Transformers function as potential systems, where every possible relationship remains latent potentiality until tasks impose constraints, akin to measurement. Yet, refining their sampling requires more than probabilistic selection: solutions must conform to specific structures or rules, ensuring consistency and the invocation of general principles. We present Graph-PReFLexOR (Graph-based Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that combines graph reasoning with symbolic abstraction to dynamically expand domain knowledge. Inspired by reinforcement learning, Graph-PReFLexOR defines reasoning as a structured mapping, where tasks yield knowledge graphs, abstract patterns, and ultimately, final answers. Inspired by category theory, it encodes concepts as nodes and their relationships as edges, supporting hierarchical inference and adaptive learning through isomorphic representations. Demonstrations include hypothesis generation, materials design, and creative reasoning, such as discovering relationships between mythological concepts like 'thin places' with materials science. We propose a 'knowledge garden growth' strategy that integrates insights across domains, promoting interdisciplinary connections. Results with a 3-billion-parameter Graph-PReFLexOR model show superior reasoning depth and adaptability, underscoring the potential for transparent, multidisciplinary AI-driven discovery. It lays the groundwork for general autonomous reasoning solutions.", + "score": 1, + "issue_id": 1683, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "f8f5360d1fb8fb75", + "authors": [ + "Markus J. Buehler" + ], + "affiliations": [ + "Laboratory for Atomistic and Molecular Mechanics, MIT, Cambridge, MA 02139, USA" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08120.jpg", + "data": { + "categories": [ + "#multimodal", + "#reasoning", + "#agents", + "#graphs", + "#rl", + "#science", + "#architecture" + ], + "emoji": "🧠", + "ru": { + "title": "Graph-PReFLexOR: Новый горизонт в автономном научном открытии", + "desc": "Статья представляет Graph-PReFLexOR - фреймворк, объединяющий графовые рассуждения с символьной абстракцией для динамического расширения предметных знаний. Вдохновленный теорией категорий, он кодирует концепции как узлы, а их отношения как ребра, поддерживая иерархический вывод и адаптивное обучение. Демонстрации включают генерацию гипотез, дизайн материалов и творческие рассуждения, такие как обнаружение связей между мифологическими концепциями и материаловедением. Результаты с 3-миллиардной моделью Graph-PReFLexOR показывают превосходную глубину рассуждений и адаптивность, подчеркивая потенциал для прозрачных, междисциплинарных решений на основе ИИ." + }, + "en": { + "title": "Empowering AI with Graph-Based Reasoning for Scientific Discovery", + "desc": "This paper introduces Graph-PReFLexOR, a novel framework that enhances automated scientific discovery by integrating graph reasoning with symbolic abstraction. It utilizes a structured mapping approach inspired by reinforcement learning, allowing for the generation of knowledge graphs and abstract patterns from various tasks. The framework supports hierarchical inference and adaptive learning, enabling it to explore interdisciplinary connections effectively. Demonstrations of its capabilities include hypothesis generation and creative reasoning, showcasing its potential for deep and adaptable reasoning in AI-driven discovery." + }, + "zh": { + "title": "知识花园的成长:跨领域的智能推理", + "desc": "这篇论文介绍了一种名为Graph-PReFLexOR的框架,它结合了图推理和符号抽象,以动态扩展领域知识。该框架通过结构化映射定义推理,利用知识图谱和抽象模式来生成最终答案。它的灵感来自强化学习和范畴理论,将概念编码为节点,关系编码为边,支持层次推理和自适应学习。实验结果表明,Graph-PReFLexOR在推理深度和适应性方面表现优越,为自动化推理解决方案奠定了基础。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.07556", + "title": "MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training", + "url": "https://huggingface.co/papers/2501.07556", + "abstract": "Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.", + "score": 0, + "issue_id": 1688, + "pub_date": "2025-01-13", + "pub_date_card": { + "ru": "13 января", + "en": "January 13", + "zh": "1月13日" + }, + "hash": "ad0c408491c545d5", + "authors": [ + "Xingyi He", + "Hao Yu", + "Sida Peng", + "Dongli Tan", + "Zehong Shen", + "Hujun Bao", + "Xiaowei Zhou" + ], + "affiliations": [ + "Shandong University", + "State Key Lab of CAD&CG, Zhejiang University" + ], + "pdf_title_img": "assets/pdf/title_img/2501.07556.jpg", + "data": { + "categories": [ + "#synthetic", + "#dataset", + "#multimodal", + "#transfer_learning", + "#cv" + ], + "emoji": "🔍", + "ru": { + "title": "Универсальное сопоставление изображений разных модальностей с помощью глубокого обучения", + "desc": "Статья представляет новый подход к сопоставлению изображений разных модальностей с использованием глубокого обучения. Авторы предлагают фреймворк для предварительного обучения на синтетических кросс-модальных данных, что позволяет модели распознавать фундаментальные структуры в изображениях. Обученная модель демонстрирует впечатляющую обобщаемость на более чем восемь новых задач кросс-модальной регистрации, значительно превосходя существующие методы. Это достижение открывает новые возможности для применения технологий сопоставления изображений в различных научных дисциплинах." + }, + "en": { + "title": "Enhancing Image Matching Across Modalities with Synthetic Training", + "desc": "This paper presents a new framework for image matching that helps identify corresponding pixel locations between images taken in different ways. Traditional deep learning methods struggle with this task due to a lack of annotated training data for different image types. The proposed solution uses synthetic training signals from diverse sources to improve the model's ability to recognize and match structures across various images. As a result, the model shows excellent performance in unseen cross-modal tasks, making it highly useful for applications in many scientific fields." + }, + "zh": { + "title": "跨模态图像匹配的新突破", + "desc": "本文提出了一种大规模预训练框架,用于解决图像匹配中的跨模态问题。该框架利用合成的跨模态训练信号,结合来自不同来源的多样化数据,训练模型识别和匹配图像中的基本结构。研究发现,使用该框架训练的匹配模型在超过八个未见的跨模态配准任务中表现出显著的泛化能力,远超现有方法。此进展大大增强了图像匹配技术在各科学领域的适用性,并为多模态人类和人工智能分析的新应用铺平了道路。" + } + } + } + ], + "link_prev": "2025-01-14.html", + "link_next": "2025-01-16.html", + "link_month": "2025-01.html", + "short_date_prev": { + "ru": "14.01", + "en": "01/14", + "zh": "1月14日" + }, + "short_date_next": { + "ru": "16.01", + "en": "01/16", + "zh": "1月16日" + }, + "categories": { + "#dataset": 9, + "#data": 5, + "#benchmark": 7, + "#agents": 1, + "#cv": 8, + "#rl": 1, + "#rlhf": 0, + "#rag": 0, + "#plp": 0, + "#inference": 1, + "#3d": 0, + "#audio": 0, + "#video": 4, + "#multimodal": 6, + "#math": 0, + "#multilingual": 1, + "#architecture": 6, + "#healthcare": 1, + "#training": 8, + "#robotics": 0, + "#agi": 1, + "#games": 2, + "#interpretability": 3, + "#reasoning": 3, + "#transfer_learning": 1, + "#graphs": 1, + "#ethics": 2, + "#security": 0, + "#optimization": 6, + "#survey": 0, + "#diffusion": 4, + "#alignment": 0, + "#story_generation": 0, + "#hallucinations": 2, + "#long_context": 1, + "#synthetic": 2, + "#machine_translation": 0, + "#leakage": 0, + "#open_source": 4, + "#small_models": 0, + "#science": 3, + "#low_resource": 2 + }, + "zh": { + "text": "我们介绍了 MiniMax-01 系列,包括 MiniMax-Text-01 和 MiniMax-VL-01。这些模型在处理长上下文方面具有卓越能力。核心在于闪电注意力和其高效扩展。我们将其与混合专家模型(MoE)集成,创建了一个具有 32 个专家和 4560 亿总参数的模型。我们开发了优化的并行策略和高效的计算通信重叠技术。这使我们能够在数百亿参数的模型上进行高效训练和推理。MiniMax-Text-01 的上下文窗口在训练期间可达到 100 万个标记,并在推理期间扩展到 400 万个标记。MiniMax-VL-01 通过使用 5120 亿视觉语言标记进行持续训练。实验表明,我们的模型在标准和内部基准上的性能与 GPT-4o 和 Claude-3.5-Sonnet 相当,同时提供 20-32 倍的上下文窗口。我们在 https://github.com/MiniMax-AI 公开发布了 MiniMax-01。", + "title": "MiniMax-01: Scaling Foundation Models with Lightning Attention", + "pinyin": "Wǒmen jièshào le MiniMax-01 xìliè, bāokuò MiniMax-Text-01 hé MiniMax-VL-01. Zhèxiē móxíng zài chǔlǐ cháng shàngxìawén fāngmiàn jùyǒu zhuóyuè nénglì. Héxīn zàiyú shǎndiǎn zhùyìlì hé qí gāoxiào kuòzhǎn. Wǒmen jiāng qí yǔ hùn hé zhuānjiā móxíng (MoE) jíchéng, chuàngjiàn le yīgè jùyǒu 32 gè zhuānjiā hé 4560 yì zǒng cānshù de móxíng. Wǒmen kāifā le yōuhuà de bìngxíng cèlüè hé gāoxiào de jìsuàn tōngxìn zhòngdié jìshù. Zhè shǐ wǒmen nénggòu zài shùbǎiyì cānshù de móxíng shàng jìnxíng gāoxiào xùnliàn hé tuìlǐ. MiniMax-Text-01 de shàngxìawén chuāngkǒu zài xùnliàn qījiān kě dádào 100 wàn gè biāojì, bìng zài tuìlǐ qījiān kuòzhǎn dào 400 wàn gè biāojì. MiniMax-VL-01 tōngguò shǐyòng 5120 yì shìjué yǔyán biāojì jìnxíng chíxù xùnliàn. Shìyàn biǎomíng, wǒmen de móxíng zài biāozhǔn hé nèibù jīzhǔn shàng de xiàonénglì yǔ GPT-4o hé Claude-3.5-Sonnet xiāngdāng, tóngshí tígōng 20-32 bèi de shàngxìawén chuāngkǒu. Wǒmen zài https://github.com/MiniMax-AI gōngkāi fābù le MiniMax-01.", + "vocab": "[\n {\"word\": \"介绍\", \"pinyin\": \"jiè shào\", \"trans\": \"introduce\"},\n {\"word\": \"系列\", \"pinyin\": \"xì liè\", \"trans\": \"series\"},\n {\"word\": \"模型\", \"pinyin\": \"mó xíng\", \"trans\": \"model\"},\n {\"word\": \"处理\", \"pinyin\": \"chǔ lǐ\", \"trans\": \"process\"},\n {\"word\": \"上下文\", \"pinyin\": \"shàng xià wén\", \"trans\": \"context\"},\n {\"word\": \"卓越\", \"pinyin\": \"zhuó yuè\", \"trans\": \"outstanding\"},\n {\"word\": \"能力\", \"pinyin\": \"néng lì\", \"trans\": \"ability\"},\n {\"word\": \"核心\", \"pinyin\": \"hé xīn\", \"trans\": \"core\"},\n {\"word\": \"闪电\", \"pinyin\": \"shǎn diàn\", \"trans\": \"lightning\"},\n {\"word\": \"注意力\", \"pinyin\": \"zhù yì lì\", \"trans\": \"attention\"},\n {\"word\": \"高效\", \"pinyin\": \"gāo xiào\", \"trans\": \"efficient\"},\n {\"word\": \"扩展\", \"pinyin\": \"kuò zhǎn\", \"trans\": \"expand\"},\n {\"word\": \"混合\", \"pinyin\": \"hùn hé\", \"trans\": \"hybrid\"},\n {\"word\": \"专家\", \"pinyin\": \"zhuān jiā\", \"trans\": \"expert\"},\n {\"word\": \"集成\", \"pinyin\": \"jí chéng\", \"trans\": \"integrate\"},\n {\"word\": \"并行\", \"pinyin\": \"bìng xíng\", \"trans\": \"parallel\"},\n {\"word\": \"策略\", \"pinyin\": \"cè lüè\", \"trans\": \"strategy\"},\n {\"word\": \"通信\", \"pinyin\": \"tōng xìn\", \"trans\": \"communication\"},\n {\"word\": \"重叠\", \"pinyin\": \"chóng dié\", \"trans\": \"overlap\"},\n {\"word\": \"技术\", \"pinyin\": \"jì shù\", \"trans\": \"technology\"},\n {\"word\": \"训练\", \"pinyin\": \"xùn liàn\", \"trans\": \"train\"},\n {\"word\": \"推理\", \"pinyin\": \"tuī lǐ\", \"trans\": \"inference\"},\n {\"word\": \"窗口\", \"pinyin\": \"chuāng kǒu\", \"trans\": \"window\"},\n {\"word\": \"标记\", \"pinyin\": \"biāo jì\", \"trans\": \"token\"},\n {\"word\": \"视觉\", \"pinyin\": \"shì jué\", \"trans\": \"visual\"},\n {\"word\": \"语言\", \"pinyin\": \"yǔ yán\", \"trans\": \"language\"},\n {\"word\": \"持续\", \"pinyin\": \"chí xù\", \"trans\": \"continuous\"},\n {\"word\": \"实验\", \"pinyin\": \"shí yàn\", \"trans\": \"experiment\"},\n {\"word\": \"性能\", \"pinyin\": \"xìng néng\", \"trans\": \"performance\"},\n {\"word\": \"基准\", \"pinyin\": \"jī zhǔn\", \"trans\": \"benchmark\"},\n {\"word\": \"公开\", \"pinyin\": \"gōng kāi\", \"trans\": \"public\"},\n {\"word\": \"发布\", \"pinyin\": \"fā bù\", \"trans\": \"release\"}\n]", + "trans": "We introduced the MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01. These models excel in handling long contexts, with a core focus on flash attention and its efficient scaling. We integrated them with a Mixture of Experts (MoE) model, creating a model with 32 experts and a total of 4560 billion parameters. We developed optimized parallel strategies and efficient computation-communication overlap techniques. This enables us to perform efficient training and inference on models with hundreds of billions of parameters. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and expands to 4 million tokens during inference. MiniMax-VL-01 undergoes continuous training using 5120 billion vision-language tokens. Experiments show that our models perform comparably to GPT-4o and Claude-3.5-Sonnet on standard and internal benchmarks while providing a 20-32 times larger context window. We have made MiniMax-01 publicly available at https://github.com/MiniMax-AI.", + "update_ts": "2025-01-15 09:11" + } +} \ No newline at end of file diff --git a/hf_papers.json b/hf_papers.json index ada2814f..05d9e32d 100644 --- a/hf_papers.json +++ b/hf_papers.json @@ -1,12 +1,12 @@ { "date": { - "ru": "15 января", - "en": "January 15", - "zh": "1月15日" + "ru": "16 января", + "en": "January 16", + "zh": "1月16日" }, - "time_utc": "2025-01-15 23:09", - "weekday": 2, - "issue_id": 1692, + "time_utc": "2025-01-16 00:44", + "weekday": 3, + "issue_id": 1693, "home_page_url": "https://huggingface.co/papers", "papers": [ { @@ -14,7 +14,7 @@ "title": "MiniMax-01: Scaling Foundation Models with Lightning Attention", "url": "https://huggingface.co/papers/2501.08313", "abstract": "We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.", - "score": 193, + "score": 195, "issue_id": 1672, "pub_date": "2025-01-14", "pub_date_card": { @@ -252,7 +252,7 @@ "title": "Diffusion Adversarial Post-Training for One-Step Video Generation", "url": "https://huggingface.co/papers/2501.08316", "abstract": "The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.", - "score": 19, + "score": 20, "issue_id": 1672, "pub_date": "2025-01-14", "pub_date_card": { @@ -357,7 +357,7 @@ "title": "Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks", "url": "https://huggingface.co/papers/2501.08326", "abstract": "We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.", - "score": 12, + "score": 14, "issue_id": 1678, "pub_date": "2025-01-14", "pub_date_card": { @@ -406,61 +406,12 @@ } } }, - { - "id": "https://huggingface.co/papers/2501.08225", - "title": "FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors", - "url": "https://huggingface.co/papers/2501.08225", - "abstract": "Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \\eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \\eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.", - "score": 12, - "issue_id": 1673, - "pub_date": "2025-01-14", - "pub_date_card": { - "ru": "14 января", - "en": "January 14", - "zh": "1月14日" - }, - "hash": "811cfd0f18eb1e53", - "authors": [ - "Yabo Zhang", - "Xinpeng Zhou", - "Yihan Zeng", - "Hang Xu", - "Hui Li", - "Wangmeng Zuo" - ], - "affiliations": [ - "Harbin Institute of Technology", - "Huawei Noahs Ark Lab" - ], - "pdf_title_img": "assets/pdf/title_img/2501.08225.jpg", - "data": { - "categories": [ - "#video", - "#cv", - "#optimization", - "#diffusion" - ], - "emoji": "🎨", - "ru": { - "title": "FramePainter: эффективное редактирование изображений через генерацию видео", - "desc": "Статья представляет FramePainter - новый подход к интерактивному редактированию изображений, основанный на генерации видео. В отличие от существующих методов, использующих модели диффузии текст-изображение, FramePainter опирается на мощные видео-диффузионные модели для обеспечения временной согласованности и снижения затрат на обучение. Метод использует легковесный энкодер для внедрения сигналов редактирования и вводит механизм согласованного внимания для улучшения обработки крупных движений между кадрами. FramePainter превосходит современные методы, требуя значительно меньше обучающих данных и демонстрируя высокую обобщающую способность." - }, - "en": { - "title": "Revolutionizing Image Editing with Efficient Video Diffusion", - "desc": "This paper presents FramePainter, a novel approach to interactive image editing that reformulates the task as image-to-video generation. By leveraging video diffusion models, FramePainter reduces the need for extensive training data while ensuring temporal consistency in edited images. It utilizes a lightweight sparse control encoder to effectively incorporate editing signals, and introduces matching attention to improve the handling of large motion between frames. The results demonstrate that FramePainter significantly outperforms existing methods, achieving seamless image edits and showcasing strong generalization capabilities." - }, - "zh": { - "title": "FramePainter:高效的图像编辑新方法", - "desc": "本文提出了一种交互式图像编辑的新方法,称为FramePainter。该方法将图像编辑任务重新定义为图像到视频的生成问题,从而利用强大的视频扩散先验,降低训练成本并确保时间一致性。FramePainter使用轻量级的稀疏控制编码器来注入编辑信号,并通过匹配注意力机制增强了对大运动的处理能力。实验结果表明,FramePainter在各种编辑信号下表现优异,能够实现无缝且连贯的图像编辑,且在未见过的场景中也展现出卓越的泛化能力。" - } - } - }, { "id": "https://huggingface.co/papers/2501.07730", "title": "Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens", "url": "https://huggingface.co/papers/2501.07730", "abstract": "Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.", - "score": 10, + "score": 12, "issue_id": 1673, "pub_date": "2025-01-13", "pub_date_card": { @@ -506,12 +457,61 @@ } } }, + { + "id": "https://huggingface.co/papers/2501.08225", + "title": "FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors", + "url": "https://huggingface.co/papers/2501.08225", + "abstract": "Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \\eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \\eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.", + "score": 12, + "issue_id": 1673, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "811cfd0f18eb1e53", + "authors": [ + "Yabo Zhang", + "Xinpeng Zhou", + "Yihan Zeng", + "Hang Xu", + "Hui Li", + "Wangmeng Zuo" + ], + "affiliations": [ + "Harbin Institute of Technology", + "Huawei Noahs Ark Lab" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08225.jpg", + "data": { + "categories": [ + "#video", + "#cv", + "#optimization", + "#diffusion" + ], + "emoji": "🎨", + "ru": { + "title": "FramePainter: эффективное редактирование изображений через генерацию видео", + "desc": "Статья представляет FramePainter - новый подход к интерактивному редактированию изображений, основанный на генерации видео. В отличие от существующих методов, использующих модели диффузии текст-изображение, FramePainter опирается на мощные видео-диффузионные модели для обеспечения временной согласованности и снижения затрат на обучение. Метод использует легковесный энкодер для внедрения сигналов редактирования и вводит механизм согласованного внимания для улучшения обработки крупных движений между кадрами. FramePainter превосходит современные методы, требуя значительно меньше обучающих данных и демонстрируя высокую обобщающую способность." + }, + "en": { + "title": "Revolutionizing Image Editing with Efficient Video Diffusion", + "desc": "This paper presents FramePainter, a novel approach to interactive image editing that reformulates the task as image-to-video generation. By leveraging video diffusion models, FramePainter reduces the need for extensive training data while ensuring temporal consistency in edited images. It utilizes a lightweight sparse control encoder to effectively incorporate editing signals, and introduces matching attention to improve the handling of large motion between frames. The results demonstrate that FramePainter significantly outperforms existing methods, achieving seamless image edits and showcasing strong generalization capabilities." + }, + "zh": { + "title": "FramePainter:高效的图像编辑新方法", + "desc": "本文提出了一种交互式图像编辑的新方法,称为FramePainter。该方法将图像编辑任务重新定义为图像到视频的生成问题,从而利用强大的视频扩散先验,降低训练成本并确保时间一致性。FramePainter使用轻量级的稀疏控制编码器来注入编辑信号,并通过匹配注意力机制增强了对大运动的处理能力。实验结果表明,FramePainter在各种编辑信号下表现优异,能够实现无缝且连贯的图像编辑,且在未见过的场景中也展现出卓越的泛化能力。" + } + } + }, { "id": "https://huggingface.co/papers/2501.05131", "title": "3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering", "url": "https://huggingface.co/papers/2501.05131", "abstract": "The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.", - "score": 9, + "score": 10, "issue_id": 1684, "pub_date": "2025-01-09", "pub_date_card": { @@ -652,6 +652,103 @@ } } }, + { + "id": "https://huggingface.co/papers/2501.07888", + "title": "Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding", + "url": "https://huggingface.co/papers/2501.07888", + "abstract": "We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\\% over GPT-4o and 5.8\\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\\% performance advantage over GPT-4o and +24.9\\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.", + "score": 6, + "issue_id": 1674, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "54780a4b6f93fb10", + "authors": [ + "Liping Yuan", + "Jiawei Wang", + "Haomiao Sun", + "Yuchen Zhang", + "Yuan Lin" + ], + "affiliations": [ + "ByteDance Research" + ], + "pdf_title_img": "assets/pdf/title_img/2501.07888.jpg", + "data": { + "categories": [ + "#dataset", + "#training", + "#cv", + "#hallucinations", + "#optimization", + "#video", + "#benchmark" + ], + "emoji": "🎥", + "ru": { + "title": "Tarsier2: Революция в понимании видео искусственным интеллектом", + "desc": "Tarsier2 - это современная крупномасштабная модель для понимания видео и языка (LVLM), разработанная для создания детальных и точных описаний видео. Модель достигает значительных улучшений благодаря увеличению объема обучающих данных, точной временной синхронизации при тонкой настройке и применению обучения с предпочтениями (DPO). Tarsier2-7B превосходит ведущие проприетарные модели, такие как GPT-4o и Gemini 1.5 Pro, в задачах детального описания видео. Модель также устанавливает новые рекорды в 15 публичных бенчмарках, демонстрируя свою универсальность как надежная модель общего назначения для понимания видео и языка." + }, + "en": { + "title": "Tarsier2: Redefining Video Understanding with Advanced LVLM Technology", + "desc": "Tarsier2 is a cutting-edge large vision-language model (LVLM) that excels in generating precise and detailed descriptions of videos while showcasing advanced video comprehension skills. The model's improvements stem from three main enhancements: increasing the pre-training dataset from 11 million to 40 million video-text pairs, implementing fine-grained temporal alignment during fine-tuning, and utilizing model-based sampling for preference data construction with DPO training for optimization. Extensive testing reveals that Tarsier2-7B surpasses top proprietary models like GPT-4o and Gemini 1.5 Pro in video description tasks, achieving notable F1 score improvements on the DREAM-1K benchmark. Additionally, Tarsier2-7B sets new records across 15 public benchmarks, proving its effectiveness in various tasks such as video question-answering and video grounding." + }, + "zh": { + "title": "Tarsier2:视频描述的新标杆", + "desc": "Tarsier2是一种先进的大型视觉语言模型,专门用于生成详细且准确的视频描述,同时具备出色的视频理解能力。该模型通过三个关键升级实现了显著进步:首先,预训练数据从1100万对视频文本扩展到4000万对,增加了数据的数量和多样性;其次,在监督微调过程中进行精细的时间对齐;最后,采用基于模型的采样自动构建偏好数据,并应用DPO训练进行优化。实验结果表明,Tarsier2-7B在视频描述任务中持续超越领先的专有模型,展现出其作为强大通用视觉语言模型的多样性。" + } + } + }, + { + "id": "https://huggingface.co/papers/2501.08292", + "title": "HALoGEN: Fantastic LLM Hallucinations and Where to Find Them", + "url": "https://huggingface.co/papers/2501.08292", + "abstract": "Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.", + "score": 6, + "issue_id": 1673, + "pub_date": "2025-01-14", + "pub_date_card": { + "ru": "14 января", + "en": "January 14", + "zh": "1月14日" + }, + "hash": "f6751d682ff824ed", + "authors": [ + "Abhilasha Ravichander", + "Shrusti Ghela", + "David Wadden", + "Yejin Choi" + ], + "affiliations": [ + "Google", + "NVIDIA", + "University of Washington" + ], + "pdf_title_img": "assets/pdf/title_img/2501.08292.jpg", + "data": { + "categories": [ + "#dataset", + "#hallucinations", + "#benchmark" + ], + "emoji": "🔍", + "ru": { + "title": "HALoGEN: Автоматическая проверка галлюцинаций в языковых моделях", + "desc": "Эта статья представляет HALoGEN - комплексный инструмент для оценки галлюцинаций в больших языковых моделях (LLM). Авторы создали набор из 10,923 промптов в девяти различных областях и автоматические верификаторы высокой точности для проверки генераций LLM. Исследование выявило, что даже лучшие модели страдают от галлюцинаций, иногда до 86% сгенерированных фактов оказываются неверными. Авторы также предложили новую классификацию ошибок LLM, разделив их на три типа в зависимости от источника галлюцинаций." + }, + "en": { + "title": "HALoGEN: A Benchmark for Measuring Hallucinations in Language Models", + "desc": "This paper introduces HALoGEN, a new benchmark designed to measure hallucinations in generative large language models (LLMs). Hallucinations refer to incorrect statements generated by these models that do not align with known facts or the given context. The benchmark includes over 10,000 prompts across various domains and employs automatic verifiers to assess the accuracy of model outputs. The study reveals that even top-performing models exhibit significant hallucinations, prompting a classification system for different types of errors to better understand their origins and improve model reliability." + }, + "zh": { + "title": "揭示生成模型的幻觉问题", + "desc": "尽管生成性大型语言模型(LLMs)能够生成高质量和流畅的文本,但它们也会产生幻觉,即与已知世界知识或输入上下文不一致的陈述。测量幻觉的难度在于,实时验证模型生成的内容既昂贵又耗时。为此,我们推出了HALoGEN,这是一个全面的幻觉基准,包含10,923个跨越九个领域的提示和自动高精度验证器。我们的研究发现,即使是表现最好的模型,其生成的原子事实中也有高达86%可能存在幻觉,这为理解生成模型的幻觉提供了基础。" + } + } + }, { "id": "https://huggingface.co/papers/2501.08197", "title": "OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training", @@ -756,103 +853,6 @@ } } }, - { - "id": "https://huggingface.co/papers/2501.07888", - "title": "Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding", - "url": "https://huggingface.co/papers/2501.07888", - "abstract": "We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\\% over GPT-4o and 5.8\\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\\% performance advantage over GPT-4o and +24.9\\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.", - "score": 5, - "issue_id": 1674, - "pub_date": "2025-01-14", - "pub_date_card": { - "ru": "14 января", - "en": "January 14", - "zh": "1月14日" - }, - "hash": "54780a4b6f93fb10", - "authors": [ - "Liping Yuan", - "Jiawei Wang", - "Haomiao Sun", - "Yuchen Zhang", - "Yuan Lin" - ], - "affiliations": [ - "ByteDance Research" - ], - "pdf_title_img": "assets/pdf/title_img/2501.07888.jpg", - "data": { - "categories": [ - "#dataset", - "#training", - "#cv", - "#hallucinations", - "#optimization", - "#video", - "#benchmark" - ], - "emoji": "🎥", - "ru": { - "title": "Tarsier2: Революция в понимании видео искусственным интеллектом", - "desc": "Tarsier2 - это современная крупномасштабная модель для понимания видео и языка (LVLM), разработанная для создания детальных и точных описаний видео. Модель достигает значительных улучшений благодаря увеличению объема обучающих данных, точной временной синхронизации при тонкой настройке и применению обучения с предпочтениями (DPO). Tarsier2-7B превосходит ведущие проприетарные модели, такие как GPT-4o и Gemini 1.5 Pro, в задачах детального описания видео. Модель также устанавливает новые рекорды в 15 публичных бенчмарках, демонстрируя свою универсальность как надежная модель общего назначения для понимания видео и языка." - }, - "en": { - "title": "Tarsier2: Redefining Video Understanding with Advanced LVLM Technology", - "desc": "Tarsier2 is a cutting-edge large vision-language model (LVLM) that excels in generating precise and detailed descriptions of videos while showcasing advanced video comprehension skills. The model's improvements stem from three main enhancements: increasing the pre-training dataset from 11 million to 40 million video-text pairs, implementing fine-grained temporal alignment during fine-tuning, and utilizing model-based sampling for preference data construction with DPO training for optimization. Extensive testing reveals that Tarsier2-7B surpasses top proprietary models like GPT-4o and Gemini 1.5 Pro in video description tasks, achieving notable F1 score improvements on the DREAM-1K benchmark. Additionally, Tarsier2-7B sets new records across 15 public benchmarks, proving its effectiveness in various tasks such as video question-answering and video grounding." - }, - "zh": { - "title": "Tarsier2:视频描述的新标杆", - "desc": "Tarsier2是一种先进的大型视觉语言模型,专门用于生成详细且准确的视频描述,同时具备出色的视频理解能力。该模型通过三个关键升级实现了显著进步:首先,预训练数据从1100万对视频文本扩展到4000万对,增加了数据的数量和多样性;其次,在监督微调过程中进行精细的时间对齐;最后,采用基于模型的采样自动构建偏好数据,并应用DPO训练进行优化。实验结果表明,Tarsier2-7B在视频描述任务中持续超越领先的专有模型,展现出其作为强大通用视觉语言模型的多样性。" - } - } - }, - { - "id": "https://huggingface.co/papers/2501.08292", - "title": "HALoGEN: Fantastic LLM Hallucinations and Where to Find Them", - "url": "https://huggingface.co/papers/2501.08292", - "abstract": "Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.", - "score": 5, - "issue_id": 1673, - "pub_date": "2025-01-14", - "pub_date_card": { - "ru": "14 января", - "en": "January 14", - "zh": "1月14日" - }, - "hash": "f6751d682ff824ed", - "authors": [ - "Abhilasha Ravichander", - "Shrusti Ghela", - "David Wadden", - "Yejin Choi" - ], - "affiliations": [ - "Google", - "NVIDIA", - "University of Washington" - ], - "pdf_title_img": "assets/pdf/title_img/2501.08292.jpg", - "data": { - "categories": [ - "#dataset", - "#hallucinations", - "#benchmark" - ], - "emoji": "🔍", - "ru": { - "title": "HALoGEN: Автоматическая проверка галлюцинаций в языковых моделях", - "desc": "Эта статья представляет HALoGEN - комплексный инструмент для оценки галлюцинаций в больших языковых моделях (LLM). Авторы создали набор из 10,923 промптов в девяти различных областях и автоматические верификаторы высокой точности для проверки генераций LLM. Исследование выявило, что даже лучшие модели страдают от галлюцинаций, иногда до 86% сгенерированных фактов оказываются неверными. Авторы также предложили новую классификацию ошибок LLM, разделив их на три типа в зависимости от источника галлюцинаций." - }, - "en": { - "title": "HALoGEN: A Benchmark for Measuring Hallucinations in Language Models", - "desc": "This paper introduces HALoGEN, a new benchmark designed to measure hallucinations in generative large language models (LLMs). Hallucinations refer to incorrect statements generated by these models that do not align with known facts or the given context. The benchmark includes over 10,000 prompts across various domains and employs automatic verifiers to assess the accuracy of model outputs. The study reveals that even top-performing models exhibit significant hallucinations, prompting a classification system for different types of errors to better understand their origins and improve model reliability." - }, - "zh": { - "title": "揭示生成模型的幻觉问题", - "desc": "尽管生成性大型语言模型(LLMs)能够生成高质量和流畅的文本,但它们也会产生幻觉,即与已知世界知识或输入上下文不一致的陈述。测量幻觉的难度在于,实时验证模型生成的内容既昂贵又耗时。为此,我们推出了HALoGEN,这是一个全面的幻觉基准,包含10,923个跨越九个领域的提示和自动高精度验证器。我们的研究发现,即使是表现最好的模型,其生成的原子事实中也有高达86%可能存在幻觉,这为理解生成模型的幻觉提供了基础。" - } - } - }, { "id": "https://huggingface.co/papers/2501.08284", "title": "AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages", @@ -951,7 +951,7 @@ "title": "In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR", "url": "https://huggingface.co/papers/2501.08120", "abstract": "The pursuit of automated scientific discovery has fueled progress from symbolic logic to modern AI, forging new frontiers in reasoning and pattern recognition. Transformers function as potential systems, where every possible relationship remains latent potentiality until tasks impose constraints, akin to measurement. Yet, refining their sampling requires more than probabilistic selection: solutions must conform to specific structures or rules, ensuring consistency and the invocation of general principles. We present Graph-PReFLexOR (Graph-based Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that combines graph reasoning with symbolic abstraction to dynamically expand domain knowledge. Inspired by reinforcement learning, Graph-PReFLexOR defines reasoning as a structured mapping, where tasks yield knowledge graphs, abstract patterns, and ultimately, final answers. Inspired by category theory, it encodes concepts as nodes and their relationships as edges, supporting hierarchical inference and adaptive learning through isomorphic representations. Demonstrations include hypothesis generation, materials design, and creative reasoning, such as discovering relationships between mythological concepts like 'thin places' with materials science. We propose a 'knowledge garden growth' strategy that integrates insights across domains, promoting interdisciplinary connections. Results with a 3-billion-parameter Graph-PReFLexOR model show superior reasoning depth and adaptability, underscoring the potential for transparent, multidisciplinary AI-driven discovery. It lays the groundwork for general autonomous reasoning solutions.", - "score": 1, + "score": 2, "issue_id": 1683, "pub_date": "2025-01-14", "pub_date_card": { @@ -1044,18 +1044,18 @@ } } ], - "link_prev": "2025-01-14.html", - "link_next": "2025-01-16.html", + "link_prev": "2025-01-15.html", + "link_next": "2025-01-17.html", "link_month": "2025-01.html", "short_date_prev": { - "ru": "14.01", - "en": "01/14", - "zh": "1月14日" + "ru": "15.01", + "en": "01/15", + "zh": "1月15日" }, "short_date_next": { - "ru": "16.01", - "en": "01/16", - "zh": "1月16日" + "ru": "17.01", + "en": "01/17", + "zh": "1月17日" }, "categories": { "#dataset": 9, diff --git a/index.html b/index.html index 5ee9ef6d..b9c2223e 100644 --- a/index.html +++ b/index.html @@ -10,7 +10,7 @@ gtag('config', 'G-C1CRWDNJ1J'); - HF. 18 papers. January 15. + HF. 18 papers. January 16. @@ -765,7 +765,7 @@

🔺

hf daily

-

15 января | 18 papers

+

16 января | 18 papers