EXL2 low bpw draft model #77
SinanAkkoyun
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey! I was wondering how one could skip training the draft model for speculative sampling alltogether by doing aggressively low bpw quantization?
I was also wondering (but that might be difficult to do) if one could theoretically look at forward pass "through-network" activations for a given dataset and disable those paths by setting zeros (skipping multiplications, somewhat like having lower parameter count)? I don't fully understand your quantization method, so with "akin to a sparse network" you probably already mean what I am asking but still, I want to know if it would be possible to "quantize a 34b model so hard that it has the latency of tinyllama".
Beta Was this translation helpful? Give feedback.
All reactions