nm-testing
/

OLMoE-1B-7B-0924-Instruct-FP8

compressed-tensors

Model card Files Files and versions Community

OLMoE-1B-7B-0924-Instruct-FP8 / README.md

mgoin's picture

Create README.md

8df500d verified 29 days ago

|

history blame contribute delete

3.08 kB

	```
	lm_eval --model vllm --model_args pretrained=/home/mgoin/code/llm-compressor/examples/quantizing_moe/OLMoE-1B-7B-0924-Instruct-FP8,tensor_parallel_size=1,trust_remote_code=True --tasks gsm8k --num_fewshot 5 --batch_size auto
	vllm (pretrained=/home/mgoin/code/llm-compressor/examples/quantizing_moe/OLMoE-1B-7B-0924-Instruct-FP8,tensor_parallel_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
	\|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\|
	\|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.3510\|± \|0.0131\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.3389\|± \|0.0130\|
	```

	## Creation
	```python
	import torch
	from datasets import load_dataset
	from transformers import AutoTokenizer

	from llmcompressor.modifiers.quantization import QuantizationModifier
	from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

	# select a Mixture of Experts model for quantization
	MODEL_ID = "allenai/OLMoE-1B-7B-0924-Instruct"

	model = SparseAutoModelForCausalLM.from_pretrained(
	MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

	# Select calibration dataset.
	# its recommended to use more calibration samples for MoE models so each expert is hit
	DATASET_ID = "HuggingFaceH4/ultrachat_200k"
	DATASET_SPLIT = "train_sft"
	NUM_CALIBRATION_SAMPLES = 2048
	MAX_SEQUENCE_LENGTH = 2048


	# Load dataset and preprocess.
	ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
	ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


	def preprocess(example):
	return {
	"text": tokenizer.apply_chat_template(
	example["messages"],
	tokenize=False,
	)
	}


	ds = ds.map(preprocess)


	# Tokenize inputs.
	def tokenize(sample):
	return tokenizer(
	sample["text"],
	padding=False,
	max_length=MAX_SEQUENCE_LENGTH,
	truncation=True,
	add_special_tokens=False,
	)


	ds = ds.map(tokenize, remove_columns=ds.column_names)

	# define a llmcompressor recipe for FP8 W8A8 quantization
	# since the MoE gate layers are sensitive to quantization, we add them to the ignore
	# list so they remain at full precision
	recipe = [
	QuantizationModifier(
	targets="Linear",
	scheme="FP8",
	ignore=["lm_head", "re:.*mlp.gate$"],
	),
	]

	SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8"

	oneshot(
	model=model,
	dataset=ds,
	recipe=recipe,
	max_seq_length=MAX_SEQUENCE_LENGTH,
	num_calibration_samples=NUM_CALIBRATION_SAMPLES,
	save_compressed=True,
	output_dir=SAVE_DIR,
	)


	print("========== SAMPLE GENERATION ==============")
	SAMPLE_INPUT = ["I love quantization because"]
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
	output = model.generate(**inputs, max_length=50)
	text_output = tokenizer.batch_decode(output)
	print(text_output)
	```