--- base_model: t5-small library_name: transformers license: apache-2.0 tags: - generated_from_trainer model-index: - name: en-af-sql-training-1727527893 results: [] datasets: - b-mc2/sql-create-context - Clinton/Text-to-sql-v1 - knowrohit07/know_sql language: - af - en pipeline_tag: text2text-generation metrics: - Exact Match - TSED (Tree Similarity of Editing Distance) - SQAM (SQL Query Analysis Metric) - BLEU score --- # en-af-sql-training-1727527893 This model is a fine-tuned version of [t5-small](https://huggingface.co./t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql. It achieves the following results on the evaluation set: - Loss: 0.0210 ## Model description This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co./t5-small) was used to train our SQL model. ## Training and Evaluation Datasets As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links: - [b-mc2/sql-create-context](https://huggingface.co./datasets/b-mc2/sql-create-context) - [Clinton/Text-to-sql-v1](https://huggingface.co./datasets/Clinton/Text-to-sql-v1) - [knowrohit07/know-sql](https://huggingface.co./datasets/knowrohit07/know_sql) We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets. ```json DatasetDict({ train: Dataset({ features: ['answer', 'question', 'context', 'afr question'], num_rows: 118692 }) test: Dataset({ features: ['answer', 'question', 'context', 'afr question'], num_rows: 14838 }) validation: Dataset({ features: ['answer', 'question', 'context', 'afr question'], num_rows: 14838 }) }) ``` The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database. *Input prompt* ```python Table context: CREATE TABLE table_55794 ( "Home team" text, "Home team score" text, "Away team" text, "Away team score" text, "Venue" text, "Crowd" real, "Date" text ) Question: Watter tuisspan het'n span mebbourne? Answer: ``` *Expected Output* ```sql SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne' ``` ## Intended uses & limitations This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way. The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows: *Tokenized Dataset* ```json DatasetDict({ train: Dataset({ features: ['input_ids', 'labels'], num_rows: 118692 }) test: Dataset({ features: ['input_ids', 'labels'], num_rows: 14838 }) validation: Dataset({ features: ['input_ids', 'labels'], num_rows: 14838 }) }) ``` *Usage* ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments # Load the model and tokenizer from Hugging Face Hub repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893" en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16) en_af_sql_model = en_af_sql_model.to('cuda') tokenizer = AutoTokenizer.from_pretrained(repo_name) question = "Watter tuisspan het'n span mebbourne?" context = "CREATE TABLE table_55794 ( "Home team" text, "Home team score" text, "Away team" text, "Away team score" text, "Venue" text, "Crowd" real, "Date" text )" prompt = f"""Tables: {context} Question: {question} Answer: """ inputs = tokenizer(prompt, return_tensors='pt') inputs = inputs.to('cuda') output = tokenizer.decode( en_af_sql_model.generate( inputs["input_ids"], max_new_tokens=200, )[0], skip_special_tokens=True ) print("Predicted SQL Query:") print(output) ``` ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.005 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 2 We used the following in our program: ```python output_dir = f'./en-af-sql-training-{str(int(time.time()))}' training_args = TrainingArguments( output_dir=output_dir, learning_rate=5e-3, num_train_epochs=2, per_device_train_batch_size=16, # batch size per device during training per_device_eval_batch_size=16, # batch size for evaluation weight_decay=0.01, logging_steps=50, evaluation_strategy='steps', # evaluation strategy to adopt during training eval_steps=500, # number of steps between evaluation ) trainer = Trainer( model=finetuned_model, args=training_args, train_dataset=tokenized_datasets['train'], eval_dataset=tokenized_datasets['validation'], ) ``` ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.0573 | 0.1348 | 500 | 0.0452 | | 0.0424 | 0.2695 | 1000 | 0.0364 | | 0.037 | 0.4043 | 1500 | 0.0323 | | 0.0356 | 0.5391 | 2000 | 0.0287 | | 0.0328 | 0.6739 | 2500 | 0.0269 | | 0.0281 | 0.8086 | 3000 | 0.0255 | | 0.0286 | 0.9434 | 3500 | 0.0238 | | 0.0269 | 1.0782 | 4000 | 0.0233 | | 0.0247 | 1.2129 | 4500 | 0.0225 | | 0.0245 | 1.3477 | 5000 | 0.0217 | | 0.0226 | 1.4825 | 5500 | 0.0214 | | 0.0245 | 1.6173 | 6000 | 0.0211 | | 0.024 | 1.7520 | 6500 | 0.0210 | | 0.0249 | 1.8868 | 7000 | 0.0210 | ### Testing results After our model was trained and validated, we evaluated the model using four evaluation metrics. - *Exact Match Accuracy:* This measured the accuracy of our model predicting the exact same SQL query as the target query. - *TSED score:* This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy. - *SQAM accuracy:* Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper). - *BLEU score:* This helps us measure the similarity between the output query and the target query. The following were the obtained results over the testing set (14838 records): - Exact Match = 35.98 % - TSED score: 0.897 - SQAM score: 74.31 % - BLEU score: 0.762 ### Citing this work: ```json @misc{jstereubssoftware_2024_Afr2SQL, title = {en-af-sql fine-tuned model}, author = {JsteReubsSoftware}, year = {2024}, url = {https://huggingface.co./JsteReubsSoftware/en-af-sql-training-1727527893} } ``` ### Framework versions - Transformers 4.44.2 - Pytorch 2.4.0 - Datasets 3.0.0 - Tokenizers 0.19.1