70TB with multiple A5000

#21

by nashid - opened Jul 25, 2023

Discussion

nashid

Jul 25, 2023

Are two A5000s with 24GB each enough for handling 70TB?

TheBloke

Owner Jul 25, 2023

Yes, that will work. Recommended to use ExLlama for maximum performance. You need to load less of the model on GPU1 - a recommended split is 17.2GB on GPU1, 24GB on GPU 2. This leaves room for context on GPU1.

neo-benjamin

Jul 27, 2023

•

edited Jul 27, 2023

@TheBloke how to spread workload to multiple GPU? Default example is:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        inject_fused_attention=False, # Required for Llama 2 70B model at this time.
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

neo-benjamin

Jul 27, 2023

how to define this spilt?

Squeezitgirdle

Jul 28, 2023

•

edited Jul 28, 2023

Yes, that will work. Recommended to use ExLlama for maximum performance. You need to load less of the model on GPU1 - a recommended split is 17.2GB on GPU1, 24GB on GPU 2. This leaves room for context on GPU1.

This is probably a dumb question, but using ExLlama or ExLlama HF isn't enough to run this on a 4090, is it?
Maybe if I can split it with my 11900k, but I don't know how to do that.

neo-benjamin

Jul 31, 2023

•

edited Jul 31, 2023

@TheBloke can you please help with this?

pribadihcr

Aug 22, 2023

@neo-benjamin
add max_memory parameter.
reference: https://huggingface.co./TheBloke/Llama-2-70B-GPTQ/discussions/9

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment