PEFT
text-generation-inference
Kenkentron commited on
Commit
b9b4bd8
·
verified ·
1 Parent(s): 80121ee

Add an update to point to paper and v0.2

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - text-generation-inference
7
  - peft
8
  ---
 
9
 
10
  ## Training details:
11
  We fine-tuned Llama2-7b-chat using LoRA. We used a batch size of 1 and a chunk size of 2048. Training involved the use of the AdamW optimizer with a learning rate of 2e-5 and gradient accumulation steps set at 8. A single training epoch was performed, along with a warm-up step of 0.03 and a weight decay rate of 0.001. The learning rate was controlled using a cosine learning rate scheduler. LoRA adapters, characterized by a rank of 8, an alpha value of 32, and a dropout rate of 0.1, were applied after all self-attention blocks and fully-connected layers. This results in total 17,891,328 trainable parameters, roughly 0.26% of the entire parameters of the base model. To optimize training performance, bf16 mixed precision training and data parallelism were employed. We used 4 Nvidia A100 (80GB) GPUs hosted on the Microsoft Azure platform. An epoch of training takes roughly 42 GPU hours.
 
6
  - text-generation-inference
7
  - peft
8
  ---
9
+ **Update:** Please refer to BrainGPT-7B-v0.2 for a model consistent with the paper - https://www.nature.com/articles/s41562-024-02046-9 (Fig. 5).
10
 
11
  ## Training details:
12
  We fine-tuned Llama2-7b-chat using LoRA. We used a batch size of 1 and a chunk size of 2048. Training involved the use of the AdamW optimizer with a learning rate of 2e-5 and gradient accumulation steps set at 8. A single training epoch was performed, along with a warm-up step of 0.03 and a weight decay rate of 0.001. The learning rate was controlled using a cosine learning rate scheduler. LoRA adapters, characterized by a rank of 8, an alpha value of 32, and a dropout rate of 0.1, were applied after all self-attention blocks and fully-connected layers. This results in total 17,891,328 trainable parameters, roughly 0.26% of the entire parameters of the base model. To optimize training performance, bf16 mixed precision training and data parallelism were employed. We used 4 Nvidia A100 (80GB) GPUs hosted on the Microsoft Azure platform. An epoch of training takes roughly 42 GPU hours.