Papers
arxiv:2403.17887

The Unreasonable Ineffectiveness of the Deeper Layers

Published on Mar 26
Β· Submitted by akhaliq on Mar 27
#1 Paper of the day

Abstract

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

Community

  • See DistillBert for more on this πŸ˜‚
Β·

hahahaha

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks for sharing your work! I was able to demonstrate the model healing process here while using ShortGPT's block influence metric for layer removal/pruning.

@gromovand @kushaltirumala @hassansh @pglo @danintheory super cool! any plans to release the code?

Β·

I attempted to implement their angular distance and healing here. Let me know if you catch anything wrong, hope it can help!

https://github.com/arcee-ai/PruneMe

We tried to replicate the results. It seems true. Deeper layers can be removed, and still, we can get a model that can generate text.

Β·

Cool! Is this the same as @shivr 's implementation?

Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.

The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:

Revolutionary Layer Pruning: Are Deeper Layers Overrated?

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 30

Browse 30 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.17887 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 30