PEFT
Safetensors
English
File size: 8,929 Bytes
724904c
 
 
2fcafc3
724904c
 
a45b7db
724904c
a45b7db
724904c
 
 
a45b7db
 
 
724904c
 
a45b7db
 
 
724904c
 
a45b7db
7057cdc
724904c
 
 
 
 
7057cdc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724904c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7057cdc
abb3b22
724904c
 
 
 
 
7057cdc
724904c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
abb3b22
724904c
 
 
 
 
 
a45b7db
 
 
 
 
 
f82ce0f
a45b7db
 
 
 
 
724904c
 
 
a45b7db
724904c
 
a45b7db
724904c
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
language: en
license: apache-2.0
library_name: peft
---

# Shears Adapter Card: shears-llama-7b-50-cs-super-adapter

The super-adapter-network fine-tuned on sparsified LLaMA-7B with some commonsense reasoning datasets using Shears.

The release of the super-network is to facilitate users to apply their own search algorithms and evaluation indicators to extract subnetworks suitable for their specific needs.

## Paper Abstract
Recently, several approaches successfully demonstrated that weight-sharing Neural Architecture Search (NAS) can effectively explore a search space of elastic low-rank adapters (LoRA), allowing the parameter-efficient fine-tuning (PEFT) and compression of large language models. In this paper, we introduce a novel approach called Shears, demonstrating how the integration of cost-effective sparsity and a proposed Neural Low-rank adapter Search (NLS) algorithm can further improve the efficiency of PEFT approaches. Results demonstrate the benefits of Shears compared to other methods, reaching high sparsity levels while improving or with little drop in accuracy, utilizing a single GPU for a pair of hours.

## Model Details

### Note
Please note, we only provide the model adapter and do not provide a copy of the base [yahma/llama-7b-hf](https://huggingface.co./yahma/llama-7b-hf) model or its sparsified one. Any use of this adapter requires a separate download of the base model and follow [this instruction](#sparsified-base-model) to sparse the base model.

### Information

- **Adapter name:** shears-llama-7b-50-cs-super-adapter
- **Base model:** Sparsified [LLaMA-7B](https://huggingface.co./yahma/llama-7b-hf)
- **Sparsity:** 50%
- **Domain:** Commonsense
- **Subnetwork version:** Super
- **NNCF Configuration:** [nncf_shears_llama.json](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/nncf_config/nncf_shears_llama.json)

### Sparsified Base Model

Shears employs a simple but effective pruning approach [Wanda](https://arxiv.org/abs/2306.11695) to sparsify the language model, serving as the base model.
Clone the [Wanda](https://github.com/locuslab/wanda) repo:

```bash
git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && cd ..
```

The command for unstructured sparsifying LLaMA-7B with Wanda, to achieve unstructured 50% sparsity:

```bash
python wanda/main.py \
    --model yahma/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save wanda_out \
    --save_model shears-llama-7b-50-base
```
- `--model`: The identifier for the model on the Hugging Face model hub or local path.
- `--sparsity_ratio`: Specifies the percentage of weights to be pruned.
- `--save_model`: Specifies the directory where the pruned language model will be stored.

Refer to our [repo](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears#setup) for the environment information to run this command.

### Adapter Configuration

- **LoRA rank:** 32
- **LoRA alpha:** 64
- **LoRA target modules:** q_proj, k_proj, v_proj, up_proj, down_proj
- **LoRA rank search space:** [32, 24, 16] (for each LoRA module)

### Training Hyperparameters

- **Batch size:** 16
- **Learning rate:** 3e-4
- **Epoch:** 5

### Training Data

Unified commonsense reasoning dataset: [commonsense_170k.json](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/ft-training_set/commonsense_170k.json).

### Evaluation Data
[BoolQ](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/boolq/test.json), [PIQA](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/piqa/test.json), [SIQA](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/social_i_qa/test.json), [HellaSwag](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/hellaswag/test.json), [WinoGrande](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/winogrande/test.json), [ARC-e](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/ARC-Easy/test.json), [ARC-c](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/ARC-Challenge/test.json), [OBQA](https://github.com/AGI-Edgerunners/LLM-Adapters/blob/main/dataset/openbookqa/test.json).



## How to use

Refer to the illustrative example provided in [load_and_explore_supernet.ipynb](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears/search/load_and_explore_supernet.ipynb) for a comprehensive understanding. This notebook shows the direct loading of a Shears super-network and the extraction of diverse subnetworks from it. 
This feature empowers users to employ their own search algorithms and evaluation metrics for the extraction of subnetworks customized to their specific requirements.

Moreover, the super-network is essentially the maximal subnetwork, and it can also be directly loaded:

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

def generate_prompt(instruction):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. 

                    ### Instruction:
                    {instruction}

                    ### Response:
                    """

base_model = AutoModelForCausalLM.from_pretrained("shears-llama-7b-50-base")
model = PeftModel.from_pretrained(base_model, "IntelLabs/shears-llama-7b-50-cs-super-adapter")
model.eval()

non_zero_params = sum([(param.data != 0).sum().item() for _, param in model.named_parameters()])
print(f"Number of all non-zero parameters: {non_zero_params}")

tokenizer = AutoTokenizer.from_pretrained("shears-llama-7b-50-base")

instruction = "Please choose the correct answer to the question: A cactus stem is used to store\n\nAnswer1: fruit "
        "Answer2: liquid Answer3: food Answer4: spines\n\nAnswer format: answer1/answer2/answer3/answer4"
prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
        use_cache=True,
        num_beams=4,
    )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
print(output)

```

## Evaluation Results

Results of the heuristic sub-network discoverd from the super-network:

| Model                | Sparsity  | BoolQ   | PIQA   | SIQA   | HellaSwag  | WinoG  | ARC-e  | ARC-c   | OBQA   | Average  |
|----------------------|-----------|---------|--------|--------|------------|--------|--------|---------|--------|----------|
| ChatGPT              | -         | 73.1    | 85.4   | 68.5   | 78.5       | 66.1   | 89.8   | 79.9    | 74.8   | 77.0     |
| LLaMA-7B-LoRA        | -         | 68.9    | 80.7   | 77.4   | 78.1       | 78.8   | 77.8   | 61.3    | 74.8   | 74.7     |
| [**LLaMA-7B-Shears**](https://huggingface.co./IntelLabs/shears-llama-7b-50-cs-heuristic-adapter)    | **50%**   | 67.3    | 79.1   | 77.5   | 73.3       | 77.7   | 74.4   | 57.9    | 72.8   | 72.5     |

## Model Sources

- **Repository:** [https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears)
- **Paper:** [Shears: Unstructured Sparsity with Neural Low-rank Adapter Search](https://arxiv.org/abs/2404.10934)

## Ethical Considerations

Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

| Ethical Considerations | Description | 
| ----------- | ----------- | 
| Data | The adapter was trained using the commonsense_170k.json data mixture as described above. |
| Human life | The model is not intended to inform decisions central to human life or flourishing. | 
| Mitigations | No additional risk mitigation strategies were considered during model development. |
| Risks and harms | This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. |
| Use cases | - | 

## Citation

```bash
@inproceedings{munoz2024shears,
  title = {Shears: Unstructured Sparsity with Neural Low-rank Adapter Search},
  author={J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  booktitle={The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2024)},
  year={2024}
}
```

## License

Apache-2.0