Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Broken for Intel Macs since v0.15 (or earlier) #3078

Open
zxcat opened this issue Dec 31, 2024 · 5 comments
Open

[Bug] Broken for Intel Macs since v0.15 (or earlier) #3078

zxcat opened this issue Dec 31, 2024 · 5 comments
Labels
bug Confirmed bugs

Comments

@zxcat
Copy link

zxcat commented Dec 31, 2024

🐛 Bug

On macos Ventura mlc_llm fails with:

InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

in chat/rest mode.

I've tried every accessible mcl_ai/mlc_llm whl pair: current nightly, v0.18.1, v0.17.2, v0.17.1 with _cpu suffix and nightly 0.15 without the suffix, but error is the same. I've tried different models, sometimes there is fused_dequantize1_NT_matmul1_… function instead of fused_dequantize1_NT_matmul5_…, but error persists.

There was another error on Catalina: something about unsupported metal version 2.3.

To Reproduce

Steps to reproduce the behavior:

  1. Follow the install guide for macos using Option 1. Prebuilt Package
  2. Verify installation
    • note: it shows strange warnings:
    python -c "import mlc_llm; print(mlc_llm)"
    
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    <module 'mlc_llm' from '/Volumes/Seagate/proj/mlc12/lib/python3.12/site-packages/mlc_llm/__init__.py'>
    
  3. Download model (HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC)
  4. Run mlc_llm chat --overrides "prefill_chunk_size=4096" ./
    • note: i had to change prefill_chunk_size, because with default value there is "not enough GPU memory" error.

Model compiles. but fails when chat starts:

Details

mlc_llm chat --overrides "prefill_chunk_size=4096" ./

[2024-12-31 16:56:49] INFO auto_device.py:88: Not found device: cuda:0
[2024-12-31 16:56:50] INFO auto_device.py:88: Not found device: rocm:0
[2024-12-31 16:56:51] INFO auto_device.py:79: Found device: metal:0
[2024-12-31 16:56:52] INFO auto_device.py:88: Not found device: vulkan:0
[2024-12-31 16:56:54] INFO auto_device.py:88: Not found device: opencl:0
[2024-12-31 16:56:54] INFO auto_device.py:35: Using device: metal:0
[2024-12-31 16:56:54] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-12-31 16:56:54] INFO jit.py:118: Compiling using commands below:
[2024-12-31 16:56:54] INFO jit.py:119: /Volumes/Seagate/proj/env-mlc11/bin/python -m mlc_llm compile . --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides prefill_chunk_size=4096 --device metal:0 --output /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:56:55] INFO auto_config.py:70: Found model configuration: mlc-chat-config.json
[2024-12-31 16:56:55] INFO auto_target.py:91: Detecting target device: metal:0
[2024-12-31 16:56:55] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
[2024-12-31 16:56:55] INFO auto_target.py:110: Found host LLVM triple: x86_64-apple-darwin22.6.0
[2024-12-31 16:56:55] INFO auto_target.py:111: Found host LLVM CPU: skylake
[2024-12-31 16:56:55] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-apple-darwin22.6.0", "tag": "", "kind": "llvm", "mcpu": "skylake", "keys": ["cpu"]}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-12-31 16:56:55] INFO config.py:107: Overriding prefill_chunk_size from 8192 to 4096
[2024-12-31 16:56:55] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-12-31 16:56:55] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-12-31 16:56:58] INFO compile.py:164: Running optimizations using TVM Unity
[2024-12-31 16:56:58] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-12-31 16:56:59] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-12-31 16:57:03] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-12-31 16:57:09] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-12-31 16:57:29] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-12-31 16:57:38] INFO pipeline.py:54: Lowering to VM bytecode
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.01 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-12-31 16:57:44] INFO pipeline.py:54: Compiling external modules
[2024-12-31 16:57:44] INFO pipeline.py:54: Compilation complete! Exporting to disk
[2024-12-31 16:57:50] INFO model_metadata.py:95: Total memory usage without KV cache:: 4932.13 MB (Parameters: 4308.13 MB. Temporary buffer: 624.00 MB)
[2024-12-31 16:57:50] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-12-31 16:57:50] INFO compile.py:207: Generated: /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:57:50] INFO jit.py:126: Using compiled model lib: /Users/zxk/.cache/mlc_llm/model_lib/ef8e85d8f28ab72418b1cbacbca56dc8.dylib
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 7575, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 8192, prefill chunk size is 4096.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 6775.170 MB (Parameters: 4308.133 MB. KVCache: 1153.041 MB. Temporary buffer: 1313.996 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  File "/Users/runner/work/package/package/tvm/src/runtime/metal/metal_module.mm", line 130
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

Same with serve mode and other models.

Expected behavior

Chat works.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): macOS Ventura 13.7.2
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): iMac 2020 + AMD Radeon Pro 5500 XT 8 Gb
  • How you installed MLC-LLM (conda, source): pip (tried every stable+nightly version)
  • How you installed TVM-Unity (pip, source): pip (tried every stable+nightly version)
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable): -
  • CUDA/cuDNN version (if applicable): -
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): -
  • Any other relevant information:

Additional context

It seems that the problem is not new. The oldest mlc-ai/mlc-llm pair I was able to test is from September:

pip list | grep mlc

mlc-ai-nightly                           0.15.dev570
mlc-llm-nightly                          0.1.dev1524

The stable version of mlc_ai_cpu-0.15.1-cp311-cp311-macosx_10_15_x86_64 (from August) has no mlc_llm_… pair, so I cannot test it, it cannot work with 0.17+ mlc_llm.

@zxcat zxcat added the bug Confirmed bugs label Dec 31, 2024
@zxcat
Copy link
Author

zxcat commented Dec 31, 2024

just tried mlc stable 0.18.1 + python3.12 in the fresh conda env, and the result is the same: fail

Details

$ conda create --prefix ./mlc12
$ conda activate ./mlc12
$ conda install python=3.12
$ python --version

Python 3.12.8

$ python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-cpu mlc-ai-cpu

…

$ python -c "import mlc_llm; print(mlc_llm)"

[22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
<module 'mlc_llm' from '/Volumes/Seagate/proj/mlc12/lib/python3.12/site-packages/mlc_llm/__init__.py'>

$ cd Llama-3-8B-Instruct-q4f16_1-MLC
$ mlc_llm chat --overrides "prefill_chunk_size=4096" ./

[22:07:25] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:07:25] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:07:25] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[2024-12-31 22:07:27] INFO auto_device.py:88: Not found device: cuda:0
[2024-12-31 22:07:29] INFO auto_device.py:88: Not found device: rocm:0
[2024-12-31 22:07:30] INFO auto_device.py:79: Found device: metal:0
[2024-12-31 22:07:31] INFO auto_device.py:88: Not found device: vulkan:0
[2024-12-31 22:07:32] INFO auto_device.py:88: Not found device: opencl:0
[2024-12-31 22:07:32] INFO auto_device.py:35: Using device: metal:0
[2024-12-31 22:07:32] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-12-31 22:07:32] INFO jit.py:118: Compiling using commands below:
[2024-12-31 22:07:32] INFO jit.py:119: /Volumes/Seagate/proj/mlc12/bin/python -m mlc_llm compile . --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides prefill_chunk_size=4096 --device metal:0 --output /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmpcolb4kr1/lib.dylib
[22:07:33] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:07:33] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[22:07:33] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
[2024-12-31 22:07:34] INFO auto_config.py:70: Found model configuration: mlc-chat-config.json
[2024-12-31 22:07:34] INFO auto_target.py:91: Detecting target device: metal:0
[2024-12-31 22:07:34] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
[2024-12-31 22:07:34] INFO auto_target.py:110: Found host LLVM triple: x86_64-apple-darwin22.6.0
[2024-12-31 22:07:34] INFO auto_target.py:111: Found host LLVM CPU: skylake
[2024-12-31 22:07:34] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-apple-darwin22.6.0", "tag": "", "kind": "llvm", "mcpu": "skylake", "keys": ["cpu"]}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmpcolb4kr1/lib.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-12-31 22:07:34] INFO config.py:107: Overriding prefill_chunk_size from 8192 to 4096
[2024-12-31 22:07:34] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-12-31 22:07:34] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-12-31 22:07:37] INFO compile.py:164: Running optimizations using TVM Unity
[2024-12-31 22:07:37] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-12-31 22:07:38] INFO pipeline.py:55: Running TVM Relax graph-level optimizations
[2024-12-31 22:07:43] INFO pipeline.py:55: Lowering to TVM TIR kernels
[2024-12-31 22:07:48] INFO pipeline.py:55: Running TVM TIR-level optimizations
[2024-12-31 22:08:08] INFO pipeline.py:55: Running TVM Dlight low-level optimizations
[2024-12-31 22:08:17] INFO pipeline.py:55: Lowering to VM bytecode
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.01 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-12-31 22:08:21] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-12-31 22:08:23] INFO pipeline.py:55: Compiling external modules
[2024-12-31 22:08:23] INFO pipeline.py:55: Compilation complete! Exporting to disk
[2024-12-31 22:08:29] INFO model_metadata.py:94: Total memory usage without KV cache:: 4932.13 MB (Parameters: 4308.13 MB. Temporary buffer: 624.00 MB)
[2024-12-31 22:08:29] INFO model_metadata.py:102: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-12-31 22:08:29] INFO compile.py:207: Generated: /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmpcolb4kr1/lib.dylib
[2024-12-31 22:08:29] INFO jit.py:126: Using compiled model lib: /Users/zxk/.cache/mlc_llm/model_lib/ef8e85d8f28ab72418b1cbacbca56dc8.dylib
[22:08:30] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[22:08:30] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[22:08:30] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 7575, prefill chunk size 4096 is specified by user.
[22:08:30] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 8192, prefill chunk size is 4096.
[22:08:30] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 6775.170 MB (Parameters: 4308.133 MB. KVCache: 1153.041 MB. Temporary buffer: 1313.996 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Volumes/Seagate/proj/mlc12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/Volumes/Seagate/proj/mlc12/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/Volumes/Seagate/proj/mlc12/lib/python3.12/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  File "/Users/runner/work/package/package/tvm/src/runtime/metal/metal_module.mm", line 130
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

@MasterJH5574
Copy link
Member

Thank you @zxcat for bringing this up. The workflow goes smooth on Apple Silicon Macs on our end. However, we don't have available Intel Macs at this moment to test.

@zxcat
Copy link
Author

zxcat commented Jan 6, 2025

@MasterJH5574, I can try to gather more info on my end. Are there options to do so?

btw, seems #2995 is the same issue

@JakoDel
Copy link

JakoDel commented Jan 13, 2025

thanks for sharing the latest working build, it's a shame it's no longer functional because there aren't many other backends that support intel mac. maybe I'll check which commit is the cause later on

@marekk1717
Copy link

I've got exactly the same issue on DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLC

Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/Users/marekk/source/p311mlc/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
_threading_Thread_run(self)
File "/usr/local/Cellar/[email protected]/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/Users/marekk/source/p311mlc/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
File "/Users/runner/work/package/package/tvm/src/runtime/metal/metal_module.mm", line 130
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

4 participants