set quantization configuration to load large model with less GPU memory

this requires the `bitsandbytes` library

bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

begin initializing HF items, need auth token for these

hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth
)

need tokenizer:

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model.eval()
print(f"Model loaded on {device}")

And then....

generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
# we pass model parameters here too
temperature=0.9, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=256 #, # mex number of tokens to generate in the output

repetition_penalty=1.1 # without this output begins repeating

)

and then:

generate_text("""Yo T how's it goin? You got any a dem no show jobs?

Response:""")

Gives:

[{'generated_text': "Yo T how's it goin? You got any a dem no show jobs?\n\n### Response:H"}]

I'm probably doing a new very dumb thing, I had 13b working a while back....

cognitivecomputations
/

WizardLM-7B-Uncensored

I get one letter responses....

set quantization configuration to load large model with less GPU memory

this requires the `bitsandbytes` library

begin initializing HF items, need auth token for these

need tokenizer:

repetition_penalty=1.1 # without this output begins repeating

Response:""")

I get one letter responses....

set quantization configuration to load large model with less GPU memory

this requires the bitsandbytes library

begin initializing HF items, need auth token for these

need tokenizer:

repetition_penalty=1.1 # without this output begins repeating

Response:""")

this requires the `bitsandbytes` library