Auto. Make Doomgrad HF Review on 14 January

averkij · Jan 14, 2025 · 4ea6ad6 · 4ea6ad6
1 parent 223cbaa
commit 4ea6ad6
Show file tree

Hide file tree

Showing 31 changed files with 2,515 additions and 838 deletions.
diff --git a/assets/img_data/2501.06252.json b/assets/img_data/2501.06252.json
@@ -0,0 +1,121 @@
+[
+    {
+        "header": "Abstract",
+        "images": []
+    },
+    {
+        "header": "1Introduction",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/x1.png",
+                "caption": "Figure 1:Overview ofTransformer2superscriptTransformer2\\text{Transformer}^{2}Transformer start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.In the training phase, we tune the scales of the singular values of the weight matrices to generate a set of “expert” vectors, each of which specializes in one type of tasks. In the inference phase, a two-pass process is adopted where the first applies the task-specific expert and the second generates the answer.",
+                "position": 88
+            }
+        ]
+    },
+    {
+        "header": "2Related works",
+        "images": []
+    },
+    {
+        "header": "3Methods",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/extracted/6116234/images/cem_code.png",
+                "caption": "",
+                "position": 218
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x2.png",
+                "caption": "Figure 2:Method overview.Left) At training time, we employ SVF and RL to learn the “expert” vectorsz𝑧zitalic_z’s that scale the singular values of the weight matrices.\nRight) At inference time, we propose three distinct methods to adaptively select/combine the learned expert vectors.",
+                "position": 231
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x3.png",
+                "caption": "Figure 3:Prompt based adaptation.Self-adaptation prompt used byTransformer2superscriptTransformer2\\text{Transformer}^{2}Transformer start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTto classify the task prompt into pre-defined categories.",
+                "position": 302
+            }
+        ]
+    },
+    {
+        "header": "4Experiments",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/x4.png",
+                "caption": "Figure 4:SVF learning curves.The dashed lines indicate the performance ofLlama3-8B-Instructon the test split of each task. SVF effectively fine-tunes to surpass the base performance. While we use the best validation score to select our checkpoint for evaluation (marked by red dots), we present the entire training curve without early stopping to demonstrate SVF’s learning capabilities. Tasks with only hundreds of training samples like Coding and Reasoning were stopped early. In our experiments, we update the parameters at the end of each epoch.",
+                "position": 343
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x5.png",
+                "caption": "Table 1:Fine-tuning results.LLM performance on the test splits of math, coding and reasoning. Normalized scores are in the parentheses.",
+                "position": 350
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x5.png",
+                "caption": "Figure 5:Results for the VLM domain.",
+                "position": 430
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x6.png",
+                "caption": "Figure 6:Confusion matrices.These matrices display the classification percentages, where rows represent the task classes (ground truth) and columns indicate the predicted categories. Some samples are misclassified as “Others,” which is reflected in rows where the totals do not sum to one.",
+                "position": 608
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x7.png",
+                "caption": "Figure 7:𝜶𝒌subscript𝜶𝒌\\bm{\\alpha_{k}}bold_italic_α start_POSTSUBSCRIPT bold_italic_k end_POSTSUBSCRIPTlearned weights.",
+                "position": 737
+            }
+        ]
+    },
+    {
+        "header": "5Conclusion",
+        "images": []
+    },
+    {
+        "header": "6Author contributions",
+        "images": []
+    },
+    {
+        "header": "References",
+        "images": []
+    },
+    {
+        "header": "Appendix AImplementation details and hyper-parameters",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/x8.png",
+                "caption": "Figure 8:Sample problem and answer.Math data sample used for LoRA instruction fine-tuning, text in blue is the unmasked solution.",
+                "position": 1352
+            }
+        ]
+    },
+    {
+        "header": "Appendix BAdditional results",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/x9.png",
+                "caption": "Figure 9:Training LoRA with policy gradient.The dashed line shows the performance ofLlama3-8B-Instructon the test split. LoRA collapses at the beginning of the training stage and fails to recover, leading to negative effects on test performance. We swept a wide range of learning rates(2×10−4,5×10−4,…,2×10−2,5×10−2)2superscript1045superscript104…21025superscript102(2\\times 10^{-4},5\\times 10^{-4},\\dots,2\\times 10{-2},5\\times 10^{-2})( 2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , … , 2 × 10 - 2 , 5 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), and all learning curves were similar to the one presented.",
+                "position": 1665
+            }
+        ]
+    },
+    {
+        "header": "Appendix CPCA on llama3 and mistral",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06252/x10.png",
+                "caption": "Figure 10:PCA ofLlama3-8B-Instruct.We show the ratio of the variance captured by the topr𝑟ritalic_rsingular components on the y-axis, and the layer indices on the x-axis. Except for the Query, Key and Value projection matrices, smallr𝑟ritalic_rvalues only capture a tiny fraction of variance in singular values in the parameter matrices.",
+                "position": 1688
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06252/x11.png",
+                "caption": "Figure 11:PCA ofMistral-7B-Instruct-v0.3.We show the ratio of the variance captured by the topr𝑟ritalic_rsingular components on the y-axis, and the layer indices on the x-axis. Except for the Query, Key and Value projection matrices, smallr𝑟ritalic_rvalues only capture a tiny fraction of variance in singular values in the parameter matrices.",
+                "position": 1691
+            }
+        ]
+    },
+    {
+        "header": "Appendix DEfficiency considerations and improvements",
+        "images": []
+    }
+]
diff --git a/assets/img_data/2501.06282.json b/assets/img_data/2501.06282.json
@@ -0,0 +1,78 @@
+[
+    {
+        "header": "Abstract",
+        "images": []
+    },
+    {
+        "header": "1Introduction",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06282/x1.png",
+                "caption": "Figure 1:Performance comparison between ourMinMo(∼similar-to\\sim∼8B parameters) and top-tier speech-text multimodal models, including Moshi(7B)(Défossez et al.,2024), Freeze-Omni(7.5B)(Wang et al.,2024b), GLM-4-Voice(9B)(Zeng et al.,2024), SeamlessM4T Large v2(2.3B)(Communication et al.,2023), NExT-GPT(12.42B)(Wu et al.,2024), speech-to-text model Qwen2-Audio(∼similar-to\\sim∼8B)(Chu et al.,2024), Whisper-large-v3(1.55B)(Radford et al.,2023), and others. We demonstrate capabilities of MinMo on automatic speech recognition (ASR), speech-to-text translation (S2TT), spoken question answering (SQA) encompasses both speech-to-text (S2T) and speech-to-speech (S2S), vocal sound classification (VSC), speech emotion recognition (SER), language identification (LID), age recognition and gender detection. ASR is evaluated using 1-WER%, with Fleurs & Common Voice results are averaged over 10 languages (zh, en, ja, ko, yue, de, fr, ru, es, it). S2TT is evaluated using BLEU, with CoVoST2 results averaged overen2zh, en2ja, zh/ja/de/fr/ru/es/it2entranslation directions. SQA is eavaluated using Accuracy. SER is evaluated using Weighted Accuracy.MinMo surpasses the previous SOTA models on all these tasks.",
+                "position": 113
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06282/extracted/6124017/figure/minmo_example_1.png",
+                "caption": "(a)An example showcases MinMo’s capabilities, including speech-to-speech chat, speech-to-text translation, style-controllable speech synthesis, and full duplex interaction.",
+                "position": 159
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06282/extracted/6124017/figure/minmo_example_1.png",
+                "caption": "(a)An example showcases MinMo’s capabilities, including speech-to-speech chat, speech-to-text translation, style-controllable speech synthesis, and full duplex interaction.",
+                "position": 162
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06282/extracted/6124017/figure/minmo_example_2.png",
+                "caption": "(b)An example showcases MinMo’s capabilities, including speech-to-speech chat, audio event detection, speaker analysis and speech-to-text translation.",
+                "position": 167
+            }
+        ]
+    },
+    {
+        "header": "2Related Work",
+        "images": []
+    },
+    {
+        "header": "3MinMo",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06282/x2.png",
+                "caption": "Figure 3:The overall architecture of MinMo. Table1provides detailed descriptions of each module in this diagram.",
+                "position": 209
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06282/extracted/6124017/figure/Speech2Text-Data.png",
+                "caption": "Figure 4:Detailed training data for the Speech-to-Text Alignment stage.Left:Data distribution forFull-Aligntraining.Right:Data distribution for instruction fine-tuning (SFT).",
+                "position": 587
+            }
+        ]
+    },
+    {
+        "header": "4Experiments",
+        "images": []
+    },
+    {
+        "header": "5Conclusion",
+        "images": []
+    },
+    {
+        "header": "6Limitations",
+        "images": []
+    },
+    {
+        "header": "7Authors (alphabetical order of family name)",
+        "images": []
+    },
+    {
+        "header": "8Acknowledgment",
+        "images": []
+    },
+    {
+        "header": "References",
+        "images": []
+    },
+    {
+        "header": "Appendix APrompts for Voice Understanding Tasks",
+        "images": []
+    }
+]
diff --git a/assets/img_data/2501.06425.json b/assets/img_data/2501.06425.json
@@ -0,0 +1,115 @@
+[
+    {
+        "header": "Abstract",
+        "images": []
+    },
+    {
+        "header": "1Introduction",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06425/x1.png",
+                "caption": "Figure 1:Tensor Product Attention (TPA) in theTensor ProducTATTenTionTransformer (T6). Different from multi-head attention, in each layer, firstly the hidden state goes through different linear layers to get the latent factor matrices𝐀𝐀\\mathbf{A}bold_A’s and𝐁𝐁\\mathbf{B}bold_B’s for query, key, and value. We additionally apply RoPE to𝐁Qsubscript𝐁𝑄\\mathbf{B}_{Q}bold_B start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPTand𝐁Ksubscript𝐁𝐾\\mathbf{B}_{K}bold_B start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPTfor query and key. Then the multi-head query, key, and value vectors are attained by the tensor product of𝐀(⋅)subscript𝐀⋅\\mathbf{A}_{(\\cdot)}bold_A start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPTand𝐁(⋅)subscript𝐁⋅\\mathbf{B}_{(\\cdot)}bold_B start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT. Finally, the output of TPA is produced by scaled dot-product attention followed by linear projection of concatenated results of multiple heads.",
+                "position": 102
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x2.png",
+                "caption": "(a)Training Loss",
+                "position": 137
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x2.png",
+                "caption": "(a)Training Loss",
+                "position": 140
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x3.png",
+                "caption": "(b)Validation Loss",
+                "position": 145
+            }
+        ]
+    },
+    {
+        "header": "2Background",
+        "images": []
+    },
+    {
+        "header": "3Tensor Product Attention",
+        "images": []
+    },
+    {
+        "header": "4Experiments",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06425/x4.png",
+                "caption": "(a)Training Loss",
+                "position": 1237
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x4.png",
+                "caption": "(a)Training Loss",
+                "position": 1240
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x5.png",
+                "caption": "(b)Validation Loss",
+                "position": 1245
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x6.png",
+                "caption": "(a)Validation Perplexity of Medium Models",
+                "position": 1252
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x6.png",
+                "caption": "(a)Validation Perplexity of Medium Models",
+                "position": 1255
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x7.png",
+                "caption": "(b)Validation Perplexity of Large Models",
+                "position": 1260
+            }
+        ]
+    },
+    {
+        "header": "5Related Work",
+        "images": []
+    },
+    {
+        "header": "6Conclusion",
+        "images": []
+    },
+    {
+        "header": "References",
+        "images": []
+    },
+    {
+        "header": "Appendix AProofs of Theorems",
+        "images": [
+            {
+                "img": "https://arxiv.org/html/2501.06425/x8.png",
+                "caption": "(a)Training Loss",
+                "position": 3177
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x8.png",
+                "caption": "(a)Training Loss",
+                "position": 3180
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x9.png",
+                "caption": "(b)Validation Loss",
+                "position": 3185
+            },
+            {
+                "img": "https://arxiv.org/html/2501.06425/x10.png",
+                "caption": "(c)Validation Perplexity",
+                "position": 3190
+            }
+        ]
+    },
+    {
+        "header": "Appendix BMore on Experiments",
+        "images": []
+    }
+]