{"cells":[{"cell_type":"markdown","id":"lw1cWgq-DI5k","metadata":{"id":"lw1cWgq-DI5k"},"source":["# Fine-tune FLAN-T5 using `bitsandbytes`, `peft` & `transformers` 🤗"]},{"cell_type":"markdown","id":"kBFPA3-aDT7H","metadata":{"id":"kBFPA3-aDT7H"},"source":["In this notebook we will see how to properly use `peft` , `transformers` & `bitsandbytes` to fine-tune `flan-t5-large` in a google colab!\n","\n","We will finetune the model on [`financial_phrasebank`](https://huggingface.co./datasets/financial_phrasebank) dataset, that consists of pairs of text-labels to classify financial-related sentences, if they are either `positive`, `neutral` or `negative`.\n","\n","Note that you could use the same notebook to fine-tune `flan-t5-xl` as well, but you would need to shard the models first to avoid CPU RAM issues on Google Colab, check [these weights](https://huggingface.co./ybelkada/flan-t5-xl-sharded-bf16)."]},{"cell_type":"markdown","source":["## TODO #1\n","\n","`google/flan-t5-large` 모델은 무엇을 목표로 만들어졌고 기대할 수 있는 기능은 무엇인지 조사하시오\n","- 마크다운 스타일로 작성하시오"],"metadata":{"id":"5TXx1vj8kJSu"},"id":"5TXx1vj8kJSu"},{"cell_type":"markdown","source":["## 'google/flan-t5-large' 모델 개요\n","\n","- 'google/flan-t5-large' 모델은 T5 아키텍처를 기반으로 합니다.\n","- T5 모델은 텍스트 입력을 받아 출력을 생성하는 시퀀스 투 시퀀스 모델로, NLP 작업을 수행하는 데 사용됩니다.\n","- 이 모델은 \"모든 것은 텍스트\"라는 접근을 따르며 입력 텍스트와 출력 텍스트를 동일한 형식으로 처리합니다.\n","\n","## 기대 기능과 활용\n","\n","- 'google/flan-t5-large' 모델은 다음과 같은 다양한 NLP 작업을 수행할 수 있습니다:\n","  - 텍스트 생성: 입력 텍스트로부터 다양한 종류의 텍스트를 생성합니다.\n","  - 요약: 긴 문서나 텍스트를 간결한 요약으로 변환합니다.\n","  - 번역: 다국어 번역 작업을 수행하며 입력 텍스트를 다른 언어로 번역합니다.\n","  - 질문 응답: 질문에 대한 답변을 생성하고, 지문과 질문을 이해하여 답변합니다.\n","  - 문장 분류: 주어진 문장을 카테고리 또는 클래스로 분류합니다.\n","\n","'google/flan-t5-large' 모델을 통해 다양한 NLP 작업을 자동화하고 향상시키기 위해, 모델의 특정 기능과 작업에 따른 설정 및 데이터가 필요하며, 이를 통해 정확하고 효율적인 자연어 처리 작업을 수행할 수 있습니다."],"metadata":{"id":"gNdrvxdIM83V"},"id":"gNdrvxdIM83V"},{"cell_type":"markdown","id":"ShAuuHCDDkvk","metadata":{"id":"ShAuuHCDDkvk"},"source":["## Install requirements"]},{"cell_type":"code","execution_count":null,"id":"DRQ4ZrJTDkSy","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DRQ4ZrJTDkSy","outputId":"3b98c09a-6889-4cdc-dddf-a7bb231b1f1d"},"outputs":[{"output_type":"stream","name":"stdout","text":["  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n","  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n","  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"]}],"source":["!pip install -q bitsandbytes datasets accelerate\n","!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main"]},{"cell_type":"markdown","id":"QBdCIrizDxFw","metadata":{"id":"QBdCIrizDxFw"},"source":["## Import model and tokenizer"]},{"cell_type":"code","execution_count":null,"id":"dd3c5acc","metadata":{"id":"dd3c5acc"},"outputs":[],"source":["# Select CUDA device index\n","import os\n","import torch\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n","\n","from datasets import load_dataset\n","from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n","\n","model_name = \"google/flan-t5-large\"\n","\n","model = AutoModelForSeq2SeqLM.from_pretrained(model_name, load_in_8bit=True)\n","tokenizer = AutoTokenizer.from_pretrained(model_name)"]},{"cell_type":"markdown","id":"VwcHieQzD_dl","metadata":{"id":"VwcHieQzD_dl"},"source":["## Prepare model for training"]},{"cell_type":"markdown","id":"4o3ePxrjEDzv","metadata":{"id":"4o3ePxrjEDzv"},"source":["Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will:\n","- Casts all the non `int8` modules to full precision (`fp32`) for stability\n","- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n","- Enable gradient checkpointing for more memory-efficient training"]},{"cell_type":"code","execution_count":null,"id":"1629ebcb","metadata":{"id":"1629ebcb"},"outputs":[],"source":["from peft import prepare_model_for_int8_training\n","\n","model = prepare_model_for_int8_training(model)"]},{"cell_type":"markdown","id":"iCpAgawAEieu","metadata":{"id":"iCpAgawAEieu"},"source":["## Load your `PeftModel`\n","\n","Here we will use LoRA (Low-Rank Adaptators) to train our model"]},{"cell_type":"code","execution_count":null,"id":"17566ae3","metadata":{"id":"17566ae3"},"outputs":[],"source":["from peft import LoraConfig, get_peft_model, TaskType\n","\n","\n","def print_trainable_parameters(model):\n","    \"\"\"\n","    Prints the number of trainable parameters in the model.\n","    \"\"\"\n","    trainable_params = 0\n","    all_param = 0\n","    for _, param in model.named_parameters():\n","        all_param += param.numel()\n","        if param.requires_grad:\n","            trainable_params += param.numel()\n","    print(\n","        f\"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}\"\n","    )\n","\n","\n","lora_config = LoraConfig(\n","    r=16, lora_alpha=32, target_modules=[\"q\", \"v\"], lora_dropout=0.05, bias=\"none\", task_type=\"SEQ_2_SEQ_LM\"\n",")\n","\n","\n","model = get_peft_model(model, lora_config)\n","print_trainable_parameters(model)"]},{"cell_type":"markdown","id":"mGkwIgNXyS7U","metadata":{"id":"mGkwIgNXyS7U"},"source":["As you can see, here we are only training 0.6% of the parameters of the model! This is a huge memory gain that will enable us to fine-tune the model without any memory issue."]},{"cell_type":"markdown","source":["## TODO #2\n","\n","위와 같이 0.6%로 학습 파라미터의 갯수가 대폭 축소된 원리에 대해 개략적으로 조사하시오.\n","- 마크다운 스타일로 작성하시오"],"metadata":{"id":"9kkyrzsakn2b"},"id":"9kkyrzsakn2b"},{"cell_type":"markdown","source":["## 모델 파라미터 크기 축소 및 메모리 효율성 개선\n","\n","제공된 코드에서 사용된 기술은 모델 파라미터의 크기를 축소하면서 메모리 효율성을 개선하고 모델을 학습 가능한 상태로 유지하는 방법입니다. 이 기술은 메모리 제약이 있는 환경에서 효과적으로 모델을 활용하는 데 도움을 줍니다.\n","\n","- **모델 파라미터 출력 및 메모리 이득**:\n","  - `print_trainable_parameters` 함수는 모델의 학습 가능한 파라미터 수를 출력합니다.\n","  - 결과에서 \"trainable params\"는 실제로 학습 가능한 파라미터 수를 나타냅니다.\n","  - \"all params\"는 모델의 총 파라미터 수를 나타냅니다.\n","  - \"trainable%\"은 학습 가능한 파라미터의 백분율을 나타냅니다.\n","  - 결과에서 \"trainable%\"가 매우 낮게 나타나면, 모델 파라미터 중 일부만이 학습 가능한 상태로 유지되고, 메모리 요구 사항이 크게 줄어들게 됩니다.\n","\n","이 접근 방식은 메모리 제약이 있는 환경에서 큰 모델을 미세 조정하고자 할 때 효과적이며, 메모리를 효율적으로 활용할 수 있게 해줍니다. 또한 모델을 효율적으로 학습하고 사용할 수 있도록 도와줍니다."],"metadata":{"id":"Yd8VN8RGNCmH"},"id":"Yd8VN8RGNCmH"},{"cell_type":"markdown","source":["## TODO #3\n","\n","모델 로드시 `load_in_8bit=True` 옵션을 사용하지 않으면 원본을 로딩한다.\n","\n","이 때의 모델 구조와,  `load_in_8bit=True` 을 사용했을 때의 무델 구조를 비교하여 어떤 차이점이 있는지를 조사하시오.\n","- 마크다운 스타일로 작성하시오"],"metadata":{"id":"wgvqtHnFlNAl"},"id":"wgvqtHnFlNAl"},{"cell_type":"markdown","source":["## `load_in_8bit=True`와 `load_in_8bit=False` 모델 로드 옵션 비교\n","\n","`load_in_8bit=True` 옵션을 사용하여 모델을 로드하는 경우와 그렇지 않은 경우 모델 구조에 차이가 있을 수 있으며, 주요 차이점은 다음과 같습니다:\n","\n","1. **모델 파라미터의 데이터 유형**:\n","   - `load_in_8bit=True`를 사용한 경우: 모델 파라미터는 8비트 정밀도로 저장됩니다. 이는 모델의 가중치와 편향을 표현하는데 사용되는 숫자가 상대적으로 작음을 의미하며, 이는 모델이 메모리를 적게 사용하는 장점을 제공합니다.\n","   - `load_in_8bit=True`를 사용하지 않은 경우: 모델 파라미터는 일반적으로 32비트 또는 16비트로 저장됩니다. 이는 모델의 가중치와 편향이 상대적으로 큰 숫자를 가질 수 있으며, 이로 인해 메모리 요구 사항이 증가할 수 있습니다.\n","\n","2. **모델 메모리 요구 사항**:\n","   - `load_in_8bit=True`를 사용한 경우: 모델이 사용하는 메모리 양이 감소하므로 더 효율적으로 작동할 수 있습니다.\n","   - `load_in_8bit=True`를 사용하지 않은 경우: 모델이 사용하는 메모리 양이 증가할 수 있습니다.\n","\n","3. **성능 및 정확도**:\n","   - `load_in_8bit=True`를 사용한 경우: 모델 파라미터의 8비트 정밀도로 인해 모델의 성능과 정확도가 감소할 수도 있습니다. 이로 인해 예측의 정확도가 저하될 수 있습니다.\n","   - `load_in_8bit=True`를 사용하지 않은 경우: 원본 정밀도로 모델 파라미터가 로드되므로 모델의 성능이 더 높을 수 있습니다.\n","\n","따라서 `load_in_8bit=True`를 사용하면 메모리 효율성이 개선되지만, 모델의 성능이 감소할 수 있으므로 적당하게 고려해서 사용해야 합니다."],"metadata":{"id":"m08rbbKxPAby"},"id":"m08rbbKxPAby"},{"cell_type":"markdown","id":"HsG0x6Z7FwjZ","metadata":{"id":"HsG0x6Z7FwjZ"},"source":["## Load and process data\n","\n","Here we will use [`financial_phrasebank`](https://huggingface.co./datasets/financial_phrasebank) dataset to fine-tune our model on sentiment classification on financial sentences. We will load the split `sentences_allagree`, which corresponds according to the model card to the split where there is a 100% annotator agreement."]},{"cell_type":"code","execution_count":null,"id":"242cdfae","metadata":{"id":"242cdfae"},"outputs":[],"source":["# loading dataset\n","dataset = load_dataset(\"financial_phrasebank\", \"sentences_allagree\")\n","dataset = dataset[\"train\"].train_test_split(test_size=0.1)\n","dataset[\"validation\"] = dataset[\"test\"]\n","del dataset[\"test\"]\n","\n","classes = dataset[\"train\"].features[\"label\"].names\n","dataset = dataset.map(\n","    lambda x: {\"text_label\": [classes[label] for label in x[\"label\"]]},\n","    batched=True,\n","    num_proc=1,\n",")"]},{"cell_type":"markdown","id":"qzwyi-Z9yzRF","metadata":{"id":"qzwyi-Z9yzRF"},"source":["Let's also apply some pre-processing of the input data, the labels needs to be pre-processed, the tokens corresponding to `pad_token_id` needs to be set to `-100` so that the `CrossEntropy` loss associated with the model will correctly ignore these tokens."]},{"cell_type":"code","execution_count":null,"id":"6b7ea44c","metadata":{"id":"6b7ea44c"},"outputs":[],"source":["# data preprocessing\n","text_column = \"sentence\"\n","label_column = \"text_label\"\n","max_length = 128\n","\n","\n","def preprocess_function(examples):\n","    inputs = examples[text_column]\n","    targets = examples[label_column]\n","    model_inputs = tokenizer(inputs, max_length=max_length, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n","    labels = tokenizer(targets, max_length=3, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n","    labels = labels[\"input_ids\"]\n","    labels[labels == tokenizer.pad_token_id] = -100\n","    model_inputs[\"labels\"] = labels\n","    return model_inputs\n","\n","\n","processed_datasets = dataset.map(\n","    preprocess_function,\n","    batched=True,\n","    num_proc=1,\n","    remove_columns=dataset[\"train\"].column_names,\n","    load_from_cache_file=False,\n","    desc=\"Running tokenizer on dataset\",\n",")\n","\n","train_dataset = processed_datasets[\"train\"]\n","eval_dataset = processed_datasets[\"validation\"]"]},{"cell_type":"markdown","source":["## TODO #4\n","\n","위 데이터셋 로딩/가공에서 사용한 허브의 데이터셋 `financial_phrasebank` 구조와 이 셋이 어떻게 미세튜닝에 활용되었는지 개략적으로 조사하시오.\n","- 마크다운 스타일로 작성하시오"],"metadata":{"id":"zmh21tjCm01z"},"id":"zmh21tjCm01z"},{"cell_type":"markdown","source":["## 'financial_phrasebank' 데이터셋을 활용한 NLP 모델 미세 튜닝 예제\n","\n","아래 코드의 개략적인 개요는 다음과 같습니다:\n","\n","1. **데이터셋 로딩 및 가공**:\n","   - `financial_phrasebank` 데이터셋은 금융 관련 텍스트 데이터이며 Hugging Face Datasets 라이브러리를 활용합니다.\n","   - 데이터는 학습 및 검증 데이터셋으로 분할되고, 레이블이 처리되어 모델 학습에 맞게 준비됩니다.\n","\n","2. **모델 미세 튜닝**:\n","   - `TrainingArguments`를 사용하여 학습 설정이 정의됩니다. 이 설정은 학습률, 배치 크기, 학습 에포크, 저장 및 평가 주기 등을 설정합니다.\n","   - `Trainer` 클래스를 활용하여 모델을 미세 튜닝합니다. 이때 모델, 학습 설정, 학습 데이터셋 및 검증 데이터셋이 사용됩니다.\n","   - `trainer.train()`을 호출하여 모델을 학습시킵니다.\n","\n","3. **모델 추론**:\n","   - 학습이 완료된 모델을 평가하고 추론하기 위해 사용됩니다.\n","   - `model.eval()`을 호출하여 모델을 추론 모드로 설정하고, 입력 문장이 정의됩니다.\n","   - 입력 문장을 토큰화하고 모델에 전달하여 모델의 출력을 생성합니다.\n","   - 모델의 출력을 해독하여 예측 결과를 얻습니다.\n","\n","4. **결과 출력**:\n","   - 입력 문장과 모델의 예측 결과가 출력됩니다.\n","\n","이 코드를 통해 금융 관련 텍스트 데이터에 대한 NLP 모델을 미세 튜닝하고 이를 통해 정확한 예측을 수행하는 간단한 예제가 제시됩니다. 미세 튜닝을 통해 모델은 특정 데이터셋과 작업에 더 잘 적응할 수 있으며, 이는 더 높은 성능과 정확도를 가능하게 합니다."],"metadata":{"id":"PzXUprxPPbI9"},"id":"PzXUprxPPbI9"},{"cell_type":"markdown","id":"bcNTdVypGEPb","metadata":{"id":"bcNTdVypGEPb"},"source":["## Train our model!\n","\n","Let's now train our model, run the cells below.\n","Note that for T5 since some layers are kept in `float32` for stability purposes there is no need to call autocast on the trainer."]},{"cell_type":"code","execution_count":null,"id":"69c756ac","metadata":{"id":"69c756ac"},"outputs":[],"source":["from transformers import TrainingArguments, Trainer\n","\n","training_args = TrainingArguments(\n","    \"temp\",\n","    evaluation_strategy=\"epoch\",\n","    learning_rate=1e-3,\n","    gradient_accumulation_steps=1,\n","    auto_find_batch_size=True,\n","    num_train_epochs=1,\n","    save_steps=100,\n","    save_total_limit=8,\n",")\n","trainer = Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=train_dataset,\n","    eval_dataset=eval_dataset,\n",")\n","model.config.use_cache = False  # silence the warnings. Please re-enable for inference!"]},{"cell_type":"code","execution_count":null,"id":"ab52b651","metadata":{"id":"ab52b651"},"outputs":[],"source":["trainer.train()"]},{"cell_type":"markdown","id":"r98VtofiGXtO","metadata":{"id":"r98VtofiGXtO"},"source":["## Qualitatively test our model"]},{"cell_type":"markdown","id":"NIm7z3UNzGPP","metadata":{"id":"NIm7z3UNzGPP"},"source":["Let's have a quick qualitative evaluation of the model, by taking a sample from the dataset that corresponds to a positive label. Run your generation similarly as you were running your model from `transformers`:"]},{"cell_type":"code","execution_count":null,"id":"c95d6173","metadata":{"id":"c95d6173"},"outputs":[],"source":["model.eval()\n","input_text = \"In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .\"\n","inputs = tokenizer(input_text, return_tensors=\"pt\")\n","\n","outputs = model.generate(input_ids=inputs[\"input_ids\"], max_new_tokens=10)\n","\n","print(\"input sentence: \", input_text)\n","print(\" output prediction: \", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))"]},{"cell_type":"markdown","source":["## TODO #5\n","\n","본인의 허깅페이스 계정을 만들고 아래 허브에 업로드/다운로드/확인 과정을 본인 계정 기준으로 진행하시오.\n","\n","진행 후 업르드한 허깅페이스 허브의 모델 id를 적으시오.\n","- 마크다운 스타일로 작성하시오."],"metadata":{"id":"ubwn2Qdbl3Fb"},"id":"ubwn2Qdbl3Fb"},{"cell_type":"markdown","source":[],"metadata":{"id":"hK-Mdl4VgKcN"},"id":"hK-Mdl4VgKcN"},{"cell_type":"markdown","id":"9QqBlwzoGZ3f","metadata":{"id":"9QqBlwzoGZ3f"},"source":["## Share your adapters on 🤗 Hub"]},{"cell_type":"markdown","id":"NT-C8SjcKqUx","metadata":{"id":"NT-C8SjcKqUx"},"source":["Once you have trained your adapter, you can easily share it on the Hub using the method `push_to_hub` . Note that only the adapter weights and config will be pushed"]},{"cell_type":"code","execution_count":null,"id":"bcbfa1f9","metadata":{"id":"bcbfa1f9"},"outputs":[],"source":["from huggingface_hub import notebook_login\n","\n","notebook_login()"]},{"cell_type":"code","execution_count":null,"id":"rFKJ4vHNGkJw","metadata":{"id":"rFKJ4vHNGkJw"},"outputs":[],"source":["model.push_to_hub(\"yysspp/flan-t5-large-financial-phrasebank-lora\", use_auth_token=True)"]},{"cell_type":"markdown","id":"xHuDmbCYJ89f","metadata":{"id":"xHuDmbCYJ89f"},"source":["## Load your adapter from the Hub"]},{"cell_type":"markdown","id":"ANFo6DdfKlU3","metadata":{"id":"ANFo6DdfKlU3"},"source":["You can load the model together with the adapter with few lines of code! Check the snippet below to load the adapter from the Hub and run the example evaluation!"]},{"cell_type":"code","execution_count":null,"id":"j097aaPWJ-9u","metadata":{"id":"j097aaPWJ-9u"},"outputs":[],"source":["import torch\n","from peft import PeftModel, PeftConfig\n","from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n","\n","peft_model_id = \"yysspp/flan-t5-large-financial-phrasebank-lora\"\n","config = PeftConfig.from_pretrained(peft_model_id)\n","\n","model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, torch_dtype=\"auto\", device_map=\"auto\")\n","tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)\n","\n","# Load the Lora model\n","model = PeftModel.from_pretrained(model, peft_model_id)"]},{"cell_type":"code","execution_count":null,"id":"jmjwWYt0KI_I","metadata":{"id":"jmjwWYt0KI_I"},"outputs":[],"source":["model.eval()\n","input_text = \"In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .\"\n","inputs = tokenizer(input_text, return_tensors=\"pt\")\n","\n","outputs = model.generate(input_ids=inputs[\"input_ids\"], max_new_tokens=10)\n","\n","print(\"input sentence: \", input_text)\n","print(\" output prediction: \", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))"]}],"metadata":{"accelerator":"GPU","colab":{"provenance":[],"gpuType":"T4","toc_visible":true},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.11"},"vscode":{"interpreter":{"hash":"1219a10c7def3e2ad4f431cfa6f49d569fcc5949850132f23800e792129eefbb"}}},"nbformat":4,"nbformat_minor":5}