This is a quantization of the QwQ-32B.
The QwQ-32B model stands out as a medium-sized reasoning powerhouse within the Qwen series, notably excelling in tasks that require advanced thinking and problem-solving capabilities. This model, with 32.5 billion parameters, is particularly adept at handling complex reasoning tasks and outperforms traditional instruction-tuned models by a significant margin. Its architecture, enriched with transformers incorporating RoPE, SwiGLU, and RMSNorm technologies, allows it to adeptly manage extensive sequences, reaching up to 131,072 tokens. Designed for enhanced reasoning abilities, the QwQ-32B model is optimized for use in challenging downstream tasks, such as complex mathematical problems and standardized multiple-choice questions, making it a valuable asset in environments where sophisticated cognitive processing is required.
Evaluations
This model provides an accuracy recovery of 100.0%.
English | QwQ-32B | QwQ-32B-FP8-Dynamic (this) |
---|---|---|
Avg. | 74.05 | 74.05 |
ARC | 72.7 | 72.8 |
Hellaswag | 75.4 | 75.3 |
We did not check for data contamination.
Evaluation was done using Eval. Harness with limit=1000
.
Usage
Install vLLM and run the server:
python -m vllm.entrypoints.openai.api_server --model cortecs/QwQ-32B-FP8-Dynamic --max-model-len 131072 --gpu-memory-utilization 0.95
Access the model:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d ' {
"model": "cortecs/QwQ-32B-FP8-Dynamic",
"prompt": "San Francisco is a"
} '
- Downloads last month
- 115