Skip to content

Commit

Permalink
ipex xpu release for llama3 release (#2780)
Browse files Browse the repository at this point in the history
  • Loading branch information
jingxu10 authored Apr 18, 2024
1 parent 3ec89ff commit c1257be
Show file tree
Hide file tree
Showing 51 changed files with 5,675 additions and 8 deletions.
6 changes: 4 additions & 2 deletions llm/llama3/cpu/_sources/index.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ python run.py --help # for more detailed usages
|---|---|
| model id | "--model-name-or-path" or "-m" to specify the <LLAMA3_MODEL_ID_OR_LOCAL_PATH>, it is model id from Huggingface or downloaded local path |
| generation | default: beam search (beam size = 4), "--greedy" for greedy search |
| input tokens | default: 32, provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
| input tokens | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs|
| output tokens | default: 32, use "--max-new-tokens" to choose any other size |
| batch size | default: 1, use "--batch-size" to choose any other size |
| token latency | enable "--token-latency" to print out the first or next token latency |
Expand All @@ -95,7 +95,8 @@ By default, for weight-only quantization, we use quantization with [Automatic Mi

- Command:
```bash
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results" --greedy --input-tokens <INPUT_LENGTH>
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results" --greedy --input-tokens <INPUT_LENGTH>
# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
```

#### 2.1.1.3 Notes:
Expand Down Expand Up @@ -134,6 +135,7 @@ For weight-only quantization with deepspeed, we quantize the model then run the
- Command:
```bash
deepspeed --bind_cores_to_rank run.py --benchmark -m <LLAMA3_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model --output-dir "saved_results"
# Note: you can add "--group-size" to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].
```

#### 2.1.2.4 How to Shard Model weight files for Distributed Inference with DeepSpeed
Expand Down
2 changes: 1 addition & 1 deletion llm/llama3/cpu/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ <h1 id="index">Index</h1>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
<jinja2.runtime.BlockReference object at 0x7f3596dc6260>
<jinja2.runtime.BlockReference object at 0x7f9b867d1de0>
<p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>


Expand Down
8 changes: 5 additions & 3 deletions llm/llama3/cpu/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ <h1>2. How To Run Llama 3 with ipex.llm<a class="headerlink" href="#how-to-run-l
</tr>
<tr>
<td>input tokens</td>
<td>default: 32, provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs</td>
<td>provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [1024, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs</td>
</tr>
<tr>
<td>output tokens</td>
Expand Down Expand Up @@ -229,7 +229,8 @@ <h4>2.1.1.2 Weight-only quantization (INT8):<a class="headerlink" href="#weight-
<ul class="simple">
<li><p>Command:</p></li>
</ul>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>num&gt;<span class="w"> </span>numactl<span class="w"> </span>-m<span class="w"> </span>&lt;node<span class="w"> </span>N&gt;<span class="w"> </span>-C<span class="w"> </span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>list&gt;<span class="w"> </span>python<span class="w"> </span>run.py<span class="w"> </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span><span class="w"> </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;<span class="w"> </span>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>num&gt;<span class="w"> </span>numactl<span class="w"> </span>-m<span class="w"> </span>&lt;node<span class="w"> </span>N&gt;<span class="w"> </span>-C<span class="w"> </span>&lt;physical<span class="w"> </span>cores<span class="w"> </span>list&gt;<span class="w"> </span>python<span class="w"> </span>run.py<span class="w"> </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span><span class="w"> </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;
<span class="c1"># Note: you can add &quot;--group-size&quot; to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].</span>
</pre></div>
</div>
</section>
Expand Down Expand Up @@ -268,6 +269,7 @@ <h4>2.1.2.3 Weight-only quantization (INT8):<a class="headerlink" href="#id2" ti
<li><p>Command:</p></li>
</ul>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>deepspeed<span class="w"> </span>--bind_cores_to_rank<span class="w"> </span>run.py<span class="w"> </span>--benchmark<span class="w"> </span>-m<span class="w"> </span>&lt;LLAMA3_MODEL_ID_OR_LOCAL_PATH&gt;<span class="w"> </span>--ipex<span class="w"> </span>--ipex-weight-only-quantization<span class="w"> </span>--weight-dtype<span class="w"> </span>INT8<span class="w"> </span>--quant-with-amp<span class="w"> </span>--greedy<span class="w"> </span>--input-tokens<span class="w"> </span>&lt;INPUT_LENGTH&gt;<span class="w"> </span>--autotp<span class="w"> </span>--shard-model<span class="w"> </span>--output-dir<span class="w"> </span><span class="s2">&quot;saved_results&quot;</span>
<span class="c1"># Note: you can add &quot;--group-size&quot; to tune good accuracy, suggested range as one of [32, 64, 128, 256, 512].</span>
</pre></div>
</div>
</section>
Expand Down Expand Up @@ -303,7 +305,7 @@ <h2>Miscellaneous Tips<a class="headerlink" href="#miscellaneous-tips" title="Li
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
<jinja2.runtime.BlockReference object at 0x7f3596b77550>
<jinja2.runtime.BlockReference object at 0x7f9b865a0c70>
<p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>


Expand Down
2 changes: 1 addition & 1 deletion llm/llama3/cpu/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
<jinja2.runtime.BlockReference object at 0x7f3596b750f0>
<jinja2.runtime.BlockReference object at 0x7f9b865a26e0>
<p></p><div><a href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html' data-cookie-notice='true'>Cookies</a> <a href='https://www.intel.com/content/www/us/en/privacy/intel-privacy-notice.html'>| Privacy</a> <a data-wap_ref='dns' id='wap_dns' href='https://www.intel.com/content/www/us/en/privacy/intel-cookie-notice.html'>| Do Not Share My Personal Information</a> </div> <p></p> <div>&copy; Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), <a href='http://opensource.org/licenses/0BSD'>http://opensource.org/licenses/0BSD</a>. </div>


Expand Down
Loading

0 comments on commit c1257be

Please sign in to comment.