Would Appreciate imatrix, such as IQ3M
As titled.
Static IQ3_M quants are usually terrible so I queued weighted/imatrix quants for this model so you can use i1-IQ3_M. They are queued and should be done in a few hours.
You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#L3.3-70B-Magnum-v4-SE-Cirrus-x1-ModelStock-GGUF for quants to appear.
Once done the weighted/imatrix quants will be available under https://huggingface.co/mradermacher/L3.3-70B-Magnum-v4-SE-Cirrus-x1-ModelStock-i1-GGUF
Thanks! And I use IQ3_M since it's the limit of what I can use, as I use gpu+cpu split ~6gb + 28 gb. Through testing I have determined that despite quantization supposedly being terrible (and there is a reduction), I am finding logic is still better for a quantized 70b than a 34b or 13b of higher precision, this is contrary to a lot of papers on this kind of comparison. I suspect that this is due to a case-by-case survivability of logic (perhaps larger structures of logic survive and still out perform? I'm no expert, not even a little, so idk.)
As long they are weighted/imatrix quants and not static quants they are great for its size. The issue are just static IQ3_M quants which is why we don't provide them.
I am finding logic is still better for a quantized 70b than a 34b or 13b of higher precision
I never heard anything to the contrary - at least at this time, this seems to be the rule rather than the exception. imatrix is pretty orthogonal to this, though.
I remember reading a paper about it, talking about how there is more loss than described when it came to quantization, but then again papers like that will have very different standards than me so what they might see as unacceptable losses I might see as negligible. Personally (as in from now on I'll be speaking from my personal experience, one that has no education in this so I might come off as ignorant, because I am.) I really think it varies with each model (at least base model), I'm sure quantization probably affects MoE models more than non-MoE models given that it'd probably hurt each expert more since it's being quantized more than each individual would need to be if on their own. And larger logic structures in a model likely survive better. Maybe it lowers the temperature requirement with quantization (or raises it)? I just can imagine how it could be seen as problematic for a commercial, needs to be accurate (but won't be lmao), small portable AI. They are probably considering heavy quantization on very small models so they can stick them into a phone, I can definitely imagine why that doesn't work out.