arxiv:2403.17887

The Unreasonable Ineffectiveness of the Deeper Layers

Published on Mar 26

· Submitted by

akhaliq on Mar 27

#1 Paper of the day

Upvote

Authors:

Andrey Gromov ,

Kushal Tirumala ,

Hassan Shapourian ,

Paolo Glorioso ,

Daniel A. Roberts

Abstract

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

View arXiv page View PDF Add to collection

Community

marcusinthesky

Mar 27

See DistillBert for more on this 😂

Ksgk-fy

Mar 27

hahahaha

librarian-bot

Mar 28

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

shivr

Mar 30

•

edited Mar 30

Thanks for sharing your work! I was able to demonstrate the model healing process here while using ShortGPT's block influence metric for layer removal/pruning.

tcapelle

Apr 1

•

edited Apr 1

Shameless plug:
https://www.nvidia.com/gtc/session-catalog/?search=Capelle%20#/session/1696005958393001R1sy

mrfakename

Apr 1

@gromovand @kushaltirumala @hassansh @pglo @danintheory super cool! any plans to release the code?

shivr

Apr 4

I attempted to implement their angular distance and healing here. Let me know if you catch anything wrong, hope it can help!

Shamane

Apr 11

https://github.com/arcee-ai/PruneMe

We tried to replicate the results. It seems true. Deeper layers can be removed, and still, we can get a model that can generate text.

mrfakename

Apr 11

Cool! Is this the same as @shivr 's implementation?

melisa

May 27

Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.

The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket: