Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] huge memory usage on iOS device using qwen2.-5-3B, is it a normal perfermance? #3083

Open
eaaajay opened this issue Jan 6, 2025 · 5 comments
Labels
question Question about the usage

Comments

@eaaajay
Copy link

eaaajay commented Jan 6, 2025

❓ General Questions

i convert qwen2.5-3B model to mlc format, when run model on iphone13 pro(ios18), the memory usage is very high, bigger than model size, as the follow picture show:
shdfkjsh

mlc-package-config.json file content:

{
    "device": "iphone",
    "model_list": [
        {
            "model": "/Users/wangyujie/mlc-llm/dist/qwen-2.5-0.5B-mlc",
            "model_id": "qwen2.5-0.5B-q4f16_1-MLC",
            "estimated_vram_bytes": 3316000000,
            "bundle_weight": true,
            "overrides": {
                           "context_window_size": 1024
                        }
        },
        {
            "model": "/Users/wangyujie/mlc-llm/dist/qwen-2.5-3B-mlc",
            "model_id": "qwen-2.5-3B-q4f16_1-MLC",
            "estimated_vram_bytes": 3316000000,
            "bundle_weight": true,
            "overrides": {
                           "context_window_size": 1024
                        }
        }
        
    ]
}
@eaaajay eaaajay added the question Question about the usage label Jan 6, 2025
@MasterJH5574
Copy link
Member

Thank you for the question. In addition to the model size, the memory consumption also contains the KV cache size and some other temporary buffer size. The memory usage you showed here looks reasonable to me.

@eaaajay
Copy link
Author

eaaajay commented Jan 6, 2025

Thank you for the question. In addition to the model size, the memory consumption also contains the KV cache size and some other temporary buffer size. The memory usage you showed here looks reasonable to me.

thanks for your reply, did i have any method to reduce the memory usage, it is too high to use on iphone, i also test other device llm-inference framework, the memoty usage is lower.

@MasterJH5574
Copy link
Member

@ted1995 You can try to reduce context_window_size and prefill_chunk_size in mlc-package-config.json. For example like

            "overrides": {
                           "context_window_size": 768,
                           "prefill_chunk_size": 256
                        }

@eaaajay eaaajay closed this as completed Jan 7, 2025
@eaaajay
Copy link
Author

eaaajay commented Jan 7, 2025

@ted1995 You can try to reduce context_window_size and prefill_chunk_size in mlc-package-config.json. For example like您可以尝试在 mlc-package-config.json 中减少 context_window_sizeprefill_chunk_size 。例如,像

            "overrides": {
                           "context_window_size": 768,
                           "prefill_chunk_size": 256
                        }

@MasterJH5574 i have try your overrides config, but the memory uasge is still huge as followed picture
image

after compile phrase, i found the dist/bundle/qwen-2.5-3B-q4f16_1-MLC/mlc-chat-config.json

{
  "version": "0.1.0",
  "model_type": "qwen2",
  "quantization": "q4f16_1",
  "model_config": {
    "hidden_act": "silu",
    "hidden_size": 2048,
    "intermediate_size": 11008,
    "num_attention_heads": 16,
    "num_hidden_layers": 36,
    "num_key_value_heads": 2,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "vocab_size": 151936,
    "tie_word_embeddings": true,
    "context_window_size": 32768,
    "prefill_chunk_size": 8192,
    "tensor_parallel_shards": 1,
    "head_dim": 128,
    "dtype": "float32",
    "max_batch_size": 128
  },
  "vocab_size": 151936,
  "context_window_size": 32768,
  "sliding_window_size": -1,
  "prefill_chunk_size": 8192,
  "attention_sink_size": -1,
  "tensor_parallel_shards": 1,
  "pipeline_parallel_stages": 1,
  "temperature": 0.7,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.05,
  "top_p": 0.8,
  "tokenizer_files": [
    "tokenizer.json",
    "vocab.json",
    "merges.txt",
    "tokenizer_config.json"
  ],
  "tokenizer_info": {
    "token_postproc_method": "byte_level",
    "prepend_space_in_encode": false,
    "strip_space_in_decode": false
  },
  "conv_template": {
    "name": "qwen2",
    "system_template": "<|im_start|>system\n{system_message}<|im_end|>\n",
    "system_message": "You are a helpful assistant.",
    "system_prefix_token_ids": null,
    "add_role_after_system_message": true,
    "roles": {
      "user": "<|im_start|>user",
      "assistant": "<|im_start|>assistant"
    },
    "role_templates": {
      "user": "{user_message}",
      "assistant": "{assistant_message}",
      "tool": "{tool_message}"
    },
    "messages": [],
    "seps": [
      "<|im_end|>\n"
    ],
    "role_content_sep": "\n",
    "role_empty_sep": "\n",
    "stop_str": [
      "<|endoftext|>",
      "<|im_end|>"
    ],
    "stop_token_ids": [
      151643,
      151645
    ],
    "function_string": "",
    "use_function_calling": false
  },
  "pad_token_id": 151643,
  "bos_token_id": 151643,
  "eos_token_id": [
    151645,
    151643
  ]
}

the context_window_size and prefill_chunk_size in json file are as followed, not my override value 768 and 256, is this cause the high memory usage?

"context_window_size": 32768,
"prefill_chunk_size": 8192,

@MasterJH5574

@eaaajay eaaajay reopened this Jan 9, 2025
@MasterJH5574
Copy link
Member

@eaaajay no, our runtime logic will prioritize the values "overrides".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants