Auto. Make Doomgrad HF Review on 17 January

averkij · Jan 17, 2025 · e71afad · e71afad
1 parent 70d9bba
commit e71afad
Show file tree

Hide file tree

Showing 7 changed files with 270 additions and 310 deletions.
diff --git a/d/2025-01-17.html b/d/2025-01-17.html
diff --git a/d/2025-01-17.json b/d/2025-01-17.json
diff --git a/hf_papers.json b/hf_papers.json
@@ -4,9 +4,9 @@
         "en": "January 17",
         "zh": "1月17日"
     },
-    "time_utc": "2025-01-17 15:10",
+    "time_utc": "2025-01-17 16:11",
     "weekday": 4,
-    "issue_id": 1730,
+    "issue_id": 1731,
     "home_page_url": "https://huggingface.co/papers",
     "papers": [
         {
@@ -451,6 +451,53 @@
                 }
             }
         },
+        {
+            "id": "https://huggingface.co/papers/2501.09653",
+            "title": "The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models",
+            "url": "https://huggingface.co/papers/2501.09653",
+            "abstract": "The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.",
+            "score": 6,
+            "issue_id": 1730,
+            "pub_date": "2025-01-16",
+            "pub_date_card": {
+                "ru": "16 января",
+                "en": "January 16",
+                "zh": "1月16日"
+            },
+            "hash": "6d731a1519dc2727",
+            "authors": [
+                "Jonathan Katzy",
+                "Razvan Mihai Popescu",
+                "Arie van Deursen",
+                "Maliheh Izadi"
+            ],
+            "affiliations": [
+                "Delft University of Technology Delft, The Netherlands"
+            ],
+            "pdf_title_img": "assets/pdf/title_img/2501.09653.jpg",
+            "data": {
+                "categories": [
+                    "#low_resource",
+                    "#multilingual",
+                    "#open_source",
+                    "#data",
+                    "#dataset"
+                ],
+                "emoji": "🗃️",
+                "ru": {
+                    "title": "The Heap: чистый код для честной оценки языковых моделей",
+                    "desc": "Статья описывает создание нового набора данных для обучения языковых моделей в области программирования. Набор данных под названием 'The Heap' охватывает 57 языков программирования и был дедуплицирован относительно других открытых наборов данных. Это позволяет исследователям проводить объективные оценки больших языковых моделей без необходимости значительной предварительной очистки данных. Создание 'The Heap' решает проблему ограниченности доступного кода для исследования специфических поведений моделей и их оценки без риска загрязнения данных."
+                },
+                "en": {
+                    "title": "The Heap: A Clean Dataset for Fair Evaluation of Language Models",
+                    "desc": "This paper introduces The Heap, a comprehensive multilingual dataset that includes code from 57 programming languages. It addresses the challenge of data contamination in evaluating large language models by providing a deduplicated dataset, ensuring that the code is unique compared to existing open datasets. Researchers can utilize The Heap for downstream tasks without the burden of extensive data cleaning. This resource aims to facilitate fair assessments of model performance in coding tasks."
+                },
+                "zh": {
+                    "title": "公平评估大型语言模型的新数据集",
+                    "desc": "随着大型语言模型的流行，开发了大量的代码数据集来训练这些模型。然而，这导致可用于特定行为研究或评估大型语言模型的代码有限，且可能存在数据污染的问题。为了解决这个问题，我们发布了The Heap，这是一个覆盖57种编程语言的大型多语言数据集，经过去重处理，避免与其他开放代码数据集重复。这样，研究人员可以在不需要大量数据清理的情况下，公平地评估大型语言模型。"
+                }
+            }
+        },
         {
             "id": "https://huggingface.co/papers/2501.09503",
             "title": "AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation",
@@ -544,53 +591,6 @@
                 }
             }
         },
-        {
-            "id": "https://huggingface.co/papers/2501.09653",
-            "title": "The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models",
-            "url": "https://huggingface.co/papers/2501.09653",
-            "abstract": "The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.",
-            "score": 3,
-            "issue_id": 1730,
-            "pub_date": "2025-01-16",
-            "pub_date_card": {
-                "ru": "16 января",
-                "en": "January 16",
-                "zh": "1月16日"
-            },
-            "hash": "6d731a1519dc2727",
-            "authors": [
-                "Jonathan Katzy",
-                "Razvan Mihai Popescu",
-                "Arie van Deursen",
-                "Maliheh Izadi"
-            ],
-            "affiliations": [
-                "Delft University of Technology Delft, The Netherlands"
-            ],
-            "pdf_title_img": "assets/pdf/title_img/2501.09653.jpg",
-            "data": {
-                "categories": [
-                    "#low_resource",
-                    "#multilingual",
-                    "#open_source",
-                    "#data",
-                    "#dataset"
-                ],
-                "emoji": "🗃️",
-                "ru": {
-                    "title": "The Heap: чистый код для честной оценки языковых моделей",
-                    "desc": "Статья описывает создание нового набора данных для обучения языковых моделей в области программирования. Набор данных под названием 'The Heap' охватывает 57 языков программирования и был дедуплицирован относительно других открытых наборов данных. Это позволяет исследователям проводить объективные оценки больших языковых моделей без необходимости значительной предварительной очистки данных. Создание 'The Heap' решает проблему ограниченности доступного кода для исследования специфических поведений моделей и их оценки без риска загрязнения данных."
-                },
-                "en": {
-                    "title": "The Heap: A Clean Dataset for Fair Evaluation of Language Models",
-                    "desc": "This paper introduces The Heap, a comprehensive multilingual dataset that includes code from 57 programming languages. It addresses the challenge of data contamination in evaluating large language models by providing a deduplicated dataset, ensuring that the code is unique compared to existing open datasets. Researchers can utilize The Heap for downstream tasks without the burden of extensive data cleaning. This resource aims to facilitate fair assessments of model performance in coding tasks."
-                },
-                "zh": {
-                    "title": "公平评估大型语言模型的新数据集",
-                    "desc": "随着大型语言模型的流行，开发了大量的代码数据集来训练这些模型。然而，这导致可用于特定行为研究或评估大型语言模型的代码有限，且可能存在数据污染的问题。为了解决这个问题，我们发布了The Heap，这是一个覆盖57种编程语言的大型多语言数据集，经过去重处理，避免与其他开放代码数据集重复。这样，研究人员可以在不需要大量数据清理的情况下，公平地评估大型语言模型。"
-                }
-            }
-        },
         {
             "id": "https://huggingface.co/papers/2501.09038",
             "title": "Do generative video models learn physical principles from watching videos?",

diff --git a/index.html b/index.html
diff --git a/log.txt b/log.txt
@@ -1,3 +1,3 @@
-[17.01.2025 15:10] Read previous papers.
-[17.01.2025 15:10] Generating top page (month).
-[17.01.2025 15:10] Writing top page (month).
+[17.01.2025 16:11] Read previous papers.
+[17.01.2025 16:11] Generating top page (month).
+[17.01.2025 16:11] Writing top page (month).