ipex xpu release for llama3 release (#2780)

intel · Apr 18, 2024 · c1257be · c1257be
1 parent 3ec89ff
commit c1257be
Show file tree

Hide file tree

Showing 51 changed files with 5,675 additions and 8 deletions.
diff --git a/llm/llama3/cpu/_sources/index.md.txt b/llm/llama3/cpu/_sources/index.md.txt
@@ -68,7 +68,7 @@ python run.py --help # for more detailed usages
 |---|---|
 | model id | "--model-name-or-path" or "-m" to specify the <LLAMA3_MODEL_ID_OR_LOCAL_PATH>, it is model id from Huggingface or downloaded local path |
 | generation | default: beam search (beam size = 4), "--greedy" for greedy search |
-| input tokens | default: 32, provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
+| input tokens | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
 | output tokens | default: 32, use "--max-new-tokens" to choose any other size |
 | batch size |  default: 1, use "--batch-size" to choose any other size |
 | token latency |  enable "--token-latency" to print out the first or next token latency |
@@ -95,7 +95,8 @@ By default, for weight-only quantization, we use quantization with [Automatic Mi
 
 - Command:
 ```bash
-OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH> 
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH>
+# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
 ```
 
 #### 2.1.1.3 Notes:
@@ -134,6 +135,7 @@ For weight-only quantization with deepspeed, we quantize the model then run the
 - Command:
 ```bash
 deepspeed --bind_cores_to_rank run.py  --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH>  --autotp --shard-model --output-dir "saved_results"
+# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
 ```
 
 #### 2.1.2.4 How to Shard Model weight files for Distributed Inference with DeepSpeed

diff --git a/llm/llama3/cpu/genindex.html b/llm/llama3/cpu/genindex.html
@@ -95,7 +95,7 @@ <h1 id="index">Index</h1>
   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
     provided by <a href="https://readthedocs.org">Read the Docs</a>.
-   <jinja2.runtime.BlockReference object at 0x7f3596dc6260> 
+   <jinja2.runtime.BlockReference object at 0x7f9b867d1de0> 
   <p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>
 
 

diff --git a/llm/llama3/cpu/index.html b/llm/llama3/cpu/index.html
@@ -185,7 +185,7 @@ <h1>2. How To Run Llama 3 with ipex.llm<a class="headerlink" href="#how-to-run-l
 </tr>
 <tr>
 <td>input tokens</td>
-<td>default: 32, provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs</td>
+<td>provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs</td>
 </tr>
 <tr>
 <td>output tokens</td>
@@ -229,7 +229,8 @@ <h4>2.1.1.2 Weight-only quantization (INT8):<a class="headerlink" href="#weight-
 <ul class="simple">
 <li><p>Command:</p></li>
 </ul>
-<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>num&gt;<span class="w"> </span>numactl<span class="w"> </span>-m<span class="w"> </span>&lt;node<span class="w"> </span>N&gt;<span class="w"> </span>-C<span class="w"> </span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>list&gt;<span class="w">  </span>python<span class="w"> </span>run.py<span class="w">  </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span><span class="w">  </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;<span class="w"> </span>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>num&gt;<span class="w"> </span>numactl<span class="w"> </span>-m<span class="w"> </span>&lt;node<span class="w"> </span>N&gt;<span class="w"> </span>-C<span class="w"> </span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>list&gt;<span class="w">  </span>python<span class="w"> </span>run.py<span class="w">  </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span><span class="w">  </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;
+<span class="c1"># Note: you can add &quot;--group-size&quot; to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].</span>
 </pre></div>
 </div>
 </section>
@@ -268,6 +269,7 @@ <h4>2.1.2.3 Weight-only quantization (INT8):<a class="headerlink" href="#id2" ti
 <li><p>Command:</p></li>
 </ul>
 <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>deepspeed<span class="w"> </span>--bind_cores_to_rank<span class="w"> </span>run.py<span class="w">  </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;<span class="w">  </span>--autotp<span class="w"> </span>--shard-model<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span>
+<span class="c1"># Note: you can add &quot;--group-size&quot; to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].</span>
 </pre></div>
 </div>
 </section>
@@ -303,7 +305,7 @@ <h2>Miscellaneous Tips<a class="headerlink" href="#miscellaneous-tips" title="Li
   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
     provided by <a href="https://readthedocs.org">Read the Docs</a>.
-   <jinja2.runtime.BlockReference object at 0x7f3596b77550> 
+   <jinja2.runtime.BlockReference object at 0x7f9b865a0c70> 
   <p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>
 
 

diff --git a/llm/llama3/cpu/search.html b/llm/llama3/cpu/search.html
@@ -103,7 +103,7 @@
   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
     provided by <a href="https://readthedocs.org">Read the Docs</a>.
-   <jinja2.runtime.BlockReference object at 0x7f3596b750f0> 
+   <jinja2.runtime.BlockReference object at 0x7f9b865a26e0> 
   <p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>