# Song Finder: Fine-Tuning and Using the Gemma-2B Model with LoRA on Hugging Face --- ### Project Overview This project demonstrates how to fine-tune and use the **Gemma-2B** model for **Question-Answering (QA) tasks**. The fine-tuning process integrates **LoRA (Low-Rank Adaptation)** to optimize memory usage. The dataset used is a custom-generated dataset consisting of question-answer pairs related to song lyrics, where the original CSV format was converted into JSON format for fine-tuning. The fine-tuned model is then capable of generating answers based on new questions provided in a similar format. --- ### Dataset Details ### 1. **Data Collection**: - The dataset was collected by **web crawling song lyrics websites**, resulting in approximately **100,000 song entries**. Each entry contains metadata such as **song lyrics**, **song ID**, and **genre ID**. ### 2. **Preprocessing**: - After data collection, we performed **missing value handling** to clean the dataset. Only the information available as **answers (A)** was retained during this stage. ### 3. **Q-A Pair Generation**: - Since only **answer (A) information** was available, corresponding **question (Q) information** was generated through the following process: 1. **Data Transformation to Tags**: Song data was transformed into **tags** that represent metadata, genre, and other characteristics. 2. **Extracting Tags for Q Generation**: Relevant tags were extracted to create questions (Q) corresponding to the answers (A). 3. **Creating Varied Wording Lists**: For each tag, **lists of possible wording variations** were created to generate a variety of question forms. This ensures that the model is trained with diverse question formats for each answer. 4. **Phonological Variations**: For song lyrics, different **phonological transformations** (e.g., consonant/vowel shifts, casual speech) were applied to simulate real-world variations in the questions generated. 5. **OpenAI API for Q-A Generation**: After these transformations, the **OpenAI API** was used to generate **question-answer pairs**. Song metadata and lyrics were provided as context for the API to generate the question (Q). These Q-A pairs were then stored in a **JSON file** for model fine-tuning. ### 4. **Data Format**: - The final set of **{'question', 'answer'} pairs** was stored in a **JSON file**, with the following structure: ```json ... { "q_script": ..., "a_script": ... } { "q_script": "What is the name of the song that was released in 2022 with the lyrics 'You got me for days'?", "a_script": "The song is 'You Got Me' by Alan Walker, released in 2022." } { "q_script": ..., "a_script": ... } ... ``` - This dataset was originally collected in **CSV format** but was **converted into JSON** to facilitate easier fine-tuning with the Gemma-2B model. --- ### Key Features: - **Gemma-2B Model**: A pre-trained causal language model that generates answers based on input questions. - **LoRA Integration**: LoRA reduces the number of trainable parameters, optimizing memory usage for fine-tuning large models like Gemma-2B. - **Custom Dataset**: The dataset consists of question-answer pairs created from song metadata, lyrics, and other related information. - **Multimodal Tasks**: This project demonstrates how to use the model to handle complex queries such as song identification and troubleshooting issues related to specific lyrics or metadata. --- ### Files Included - **Training Script**: A Python script for fine-tuning the Gemma-2B model using Keras and LoRA. - **Dataset File**: JSON file of question-answer pairs used for training. - **Inference Example**: Example prompts to query the fine-tuned model and generate answers. --- ### How to Run ### 1. Install the Required Libraries: Ensure you have the required libraries installed: ```bash !pip install transformers keras-nlp pandas datasets huggingface_hub ``` ### 2. Load the Dataset The original dataset was in CSV format but was converted into JSON for fine-tuning. The following code demonstrates how to load the CSV file and convert it to a JSON file: ```python import pandas as pd import json import random # Load the CSV dataset csv_file = "qa_dataset.csv" df = pd.read_csv(csv_file) # Convert CSV data to JSON format json_file = "qa_dataset.json" data = df.to_dict(orient="records") # Save the data as a JSON file with open(json_file, "w", encoding="utf-8") as f: json.dump(data, f, ensure_ascii=False, indent=4) print(f"JSON file saved: {json_file}") ``` ### 3. Fine-Tuning the Model Use the following code to fine-tune the **Gemma-2B** model with **LoRA**: ```python import keras from transformers import AutoTokenizer, AutoModelForCausalLM from keras_nlp.models import GemmaCausalLM # Load the fine-tuning dataset from the JSON file with open(json_file, 'r', encoding='utf-8') as f: data = json.load(f) # Format the data for fine-tuning formatted_data = [f"Question: {item['q_script']} Answer: {item['a_script']}" for item in data] # Load the Gemma-2B model gemma_model_id = "gemma2_instruct_2b_en" gemma_lm = GemmaCausalLM.from_preset(gemma_model_id) # Enable LoRA for the model gemma_lm.backbone.enable_lora(rank=4) # Set sequence length and compile the model gemma_lm.preprocessor.sequence_length = 128 optimizer = keras.optimizers.AdamW(learning_rate=5e-5, weight_decay=0.01) gemma_lm.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=optimizer, weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()], ) # Fine-tune the model with the formatted data gemma_lm.fit(formatted_data, epochs=1, batch_size=1) ``` ### 4. Querying the Model Once the model is fine-tuned, you can query it with new questions: ```python # Define a function to format a question into a prompt template def ask_question(query: str) -> str: template = "Question: {question} Answer: {answer}" prompt = template.format(question=query, answer="") return prompt # Ask a question and generate an answer prompt = ask_question("How can I install Python 3 on an AWS EC2 instance?") print(gemma_lm.generate(prompt, max_length=512)) ``` --- ### LoRA Integration **LoRA (Low-Rank Adaptation)** is used in this project to optimize the fine-tuning process. By reducing the number of trainable parameters, LoRA allows for efficient fine-tuning of large models like Gemma-2B, even on devices with limited memory. --- ### Uploading the Fine-Tuned Model to Hugging Face Once the model is fine-tuned, you can upload it to the Hugging Face Hub using the following steps: ### 1. Install Hugging Face CLI ```bash !pip install huggingface_hub !huggingface-cli login ``` ### 2. Save and Push the Model to Hugging Face ```python from huggingface_hub import HfApi # Save and push model to Hugging Face model_name = "my-fine-tuned-gemma" gemma_lm.save_pretrained(model_name) tokenizer.save_pretrained(model_name) # Push to Hugging Face Hub api = HfApi() api.upload_folder( folder_path=model_name, repo_id="username/my-fine-tuned-gemma", repo_type="model" ) ``` --- ### Example Inference Here’s an example of querying the fine-tuned model: ```python # Define a prompt with a specific query prompt = ask_question("노래 제목이 도대체 뭐였는지 기억 안 나서 답답해요. 2016년경에 발표된 곡 같아요. 대중 음악 스타일 맞는 것 같아. 아마도 The Chainsmokers의 곡일 거야.") # Generate the response print(gemma_lm.generate(prompt, max_length=1012)) ``` Output: ``` 질문하신 노래는 ‘The Chainsmokers’의 ‘Closer (Feat. Halsey)’입니다. 해당 노래는 ‘일렉트로니카’ 장르의 노래입니다. ‘2016.11.05’ 발매되었습니다. ‘Collage EP’ 앨범에 수록되었습니다. ``` --- ### Acknowledgments We would like to thank: - Google ML Bootcamp team for providing such opportunity. - Hugging Face for providing excellent tools and models. - The Keras team for Keras NLP support and integration. - All contributors to the **Gemma** and **LoRA** projects. --- ### Contributors This project was developed by: - **Seohyun Kang** - **Sujin Kim** - **Mingyu Jo**