Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co./docs/hub/model-cards#model-card-metadata)

Shortened LLM Model Card

Shortened LLM is a depth-pruned version of large language models for efficient text generation.

Compression Method

  • After identifying unimportant Transformer blocks, we perform one-shot pruning.
  • In retraining pruned models for quality recovery, continued pretraining (CPT) on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Source
Model
Pruning
Ratio
Pruning
Criterion
HF Models
Link
Vicuna-v1.3-7B 20% PPL nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 45% PPL nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B 60% PPL nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B 80% PPL nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

Experimental Setup for CPT of Pruned Vicuna-7B

  • Dataset: SlimPajama-627B
  • Training using 8 NVIDIA H100 GPUs.
    • 5.5B parameters: 37B training tokens (for 6 days)
    • 3.7B parameters: 74B tokens (for 8 days)
    • 2.7B parameters: 150B tokens (for 12 days)
    • 1.5B parameters: 271B tokens (for 11 days)
  • AdamW optimizer with (β1, β2)=(0.9, 0.95); a learning rate of 0.0001; a weight decay of 0.1.
  • Global batch size: 512 (micro-batch size of 2 × 32 gradient accumulation steps × 8 GPUs).
Click to see the learning curve:

Zero-shot performance over the course of training for models from Vicuna-7B-v1.3 at different pruning ratios. For each model size, the CPT duration was limited to a two-week period, but additional training could further improve the quality.

results

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Source
Model
Pruning
Ratio
Pruning
Criterion
HF Models
Link
LLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

License

  • All rights related to this repository and the compressed models are reserved by Nota Inc.
  • The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}
Downloads last month
4
Safetensors
Model size
3.7B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl

Finetunes
6 models
Quantizations
1 model

Collection including nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl