exl2

#1
by Handgun1773 - opened

If you find the time, exl2 qwants would be really appreciated. I tried locally but my 3060 ooms at the 47th layer :(

 -- Linear: model.layers.47.mlp.up_proj -> 0.1:8b_128g/0.9:6b_128g s4, 6.23 bpw
 -- Linear: model.layers.47.mlp.down_proj -> 0.15:8b_128g/0.85:6b_128g s4, 6.35 bpw
 -- Module quantized, rfn_error: 0.006383
 -- Layer: model.norm (RMSNorm)
 -- Module quantized, rfn_error: 0.000000
 -- Layer: lm_head (Linear)
 -- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.33 bpw
 !! Out of memory (Q), moving to device 1
Traceback (most recent call last):
  File "/path/to/convert.py", line 1, in <module>
    import exllamav2.conversion.convert_exl2
  File "/path/to/exllamav2/exllamav2/conversion/convert_exl2.py", line 296, in <module>
    quant(job, save_job, model)
  File "/path/to/exllamav2/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/exllamav2/exllamav2/conversion/quantize.py", line 424, in quant
    quant_lm_head(job, module, hidden_states, quantizers, attn_params, rtn)
  File "/path/to/exllamav2/exllamav2/conversion/quantize.py", line 209, in quant_lm_head
    quant_linear(job, module, q, qp.get_dict(), drop = True, rtn = rtn)
  File "/path/to/exllamav2/exllamav2/conversion/quantize.py", line 66, in quant_linear
lq.quantize(keep_qweight = True, apply = True)
File "/path/to/adaptivegptq.py", line 534, in quantize
    raise e
File "/path/to/adaptivegptq.py", line 494, in quantize
    error = weights.clone()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.90 GiB. GPU 0 has a total capacity of 11.75 GiB of which 2.70 GiB is free. Including non-PyTorch memory, this process has 9.05 GiB memory in use. Of the allocated memory 8.83 GiB is allocated by PyTorch, and 43.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Hmm I'm having troubles in a different way.. will need to debug it, some tokenizer issue

Thank you very much for the upload!!!

FYI, on my setup, it constantly crashes my tabbyapi docker.
Now I've overridden the config.json with the one from your qwen 14b instruct and it seems to work, quality feels great too.
Here is the error with when i try to run the 5_0 branch from your repo:

tabbyapi  | ERROR:      File "/app/endpoints/OAI/utils/completion.py", line 153, in
tabbyapi  | load_inline_model
tabbyapi  | ERROR:        await model.load_model(model_path)
tabbyapi  | ERROR:      File "/app/common/model.py", line 101, in load_model
tabbyapi  | ERROR:        async for _ in load_model_gen(model_path, **kwargs):
tabbyapi  | ERROR:      File "/app/common/model.py", line 80, in load_model_gen
tabbyapi  | ERROR:        async for module, modules in load_status:
tabbyapi  | ERROR:      File "/app/backends/exllamav2/model.py", line 542, in load_gen
tabbyapi  | ERROR:        await self.create_generator()
tabbyapi  | ERROR:      File "/app/backends/exllamav2/model.py", line 721, in
tabbyapi  | create_generator
tabbyapi  | ERROR:        self.generator = ExLlamaV2DynamicGeneratorAsync(
tabbyapi  | ERROR:      File
tabbyapi  | "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/dynamic_async.py",
tabbyapi  | line 16, in __init__
tabbyapi  | ERROR:        self.generator = ExLlamaV2DynamicGenerator(*args, **kwargs)
tabbyapi  | ERROR:      File
tabbyapi  | "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/dynamic.py", line
tabbyapi  | 400, in __init__
tabbyapi  | ERROR:        assert self.max_chunk_size % self.page_size == 0, \
tabbyapi  | ERROR:    AssertionError: max_chunk_size must be multiple of 256, received None

I can open an issue on the exl2 model page if you wish.

Sign up or log in to comment