Spaces:

motherduckdb
/

DuckDB-NSQL-7B

Running

App Files Files Community

tdoehmen commited on Jan 29

Commit

a3ab7c7

•

1 Parent(s): dcec0ff

added azure endpoint

Browse files

Files changed (2) hide show

MODEL_README.md +0 -156
app.py +40 -3

MODEL_README.md DELETED Viewed

@@ -1,156 +0,0 @@
----
-license: llama2
-inference:
-  parameters:
-    do_sample: false
-    max_length: 200
-widget:
-- text: "CREATE TABLE stadium (\n    stadium_id number,\n    location text,\n    name text,\n    capacity number,\n)\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- how many stadiums in total?\n\nSELECT"
-  example_title: "Number stadiums"
-- text: "CREATE TABLE work_orders ( ID NUMBER, CREATED_AT TEXT, COST FLOAT, INVOICE_AMOUNT FLOAT, IS_DUE BOOLEAN, IS_OPEN BOOLEAN, IS_OVERDUE BOOLEAN, COUNTRY_NAME TEXT, )\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- how many work orders are open?\n\nSELECT"
-  example_title: "Open work orders"
-- text: "CREATE TABLE stadium ( stadium_id number, location text, name text, capacity number, highest number, lowest number, average number )\n\nCREATE TABLE singer ( singer_id number, name text, country text, song_name text, song_release_year text, age number, is_male others )\n\nCREATE TABLE concert ( concert_id number, concert_name text, theme text, stadium_id text, year text )\n\nCREATE TABLE singer_in_concert ( concert_id number, singer_id text )\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- What is the maximum, the average, and the minimum capacity of stadiums ?\n\nSELECT"
-  example_title: "Stadium capacity"
----
-# DucKDB-NSQL-7B
-## Model Description
-NSQL is a family of autoregressive open-source large foundation models (FMs) designed specifically for SQL generation tasks.
-In this repository we are introducing a new member of NSQL, DuckDB-NSQL. It's based on Meta's original [Llama-2 7B model](https://huggingface.co/meta-llama/Llama-2-7b) and further pre-trained on a dataset of general SQL queries and then fine-tuned on a dataset composed of DuckDB text-to-SQL pairs.
-## Training Data
-The general SQL queries are the SQL subset from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), containing 1M training samples. The samples we transpiled to DuckDB SQL, using [sqlglot](https://github.com/tobymao/sqlglot). The labeled text-to-SQL pairs come [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) that were also transpiled to DuckDB SQL, and 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation.
-## Evaluation Data
-We evaluate our models on a DuckDB-specific benchmark that contains 75 text-to-SQL pairs. The benchmark is available [here](https://github.com/NumbersStationAI/DuckDB-NSQL/).
-## Training Procedure
-DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We pre-trained for 3 epochs and fine-tuned for 10 epochs.
-## Intended Use and Limitations
-The model was designed for text-to-SQL generation tasks from given table schema and natural language prompts. The model works best with the prompt format defined below and outputs.
-In contrast to existing text-to-SQL models, the SQL generation is not contrained to `SELECT` statements, but can generate any valid DuckDB SQL statement, including statements for official DuckDB extensions.
-## How to Use
-Example 1:
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
-text = """CREATE TABLE stadium (
-    stadium_id number,
-    location text,
-    name text,
-    capacity number,
-    highest number,
-    lowest number,
-    average number
-)
-CREATE TABLE singer (
-    singer_id number,
-    name text,
-    country text,
-    song_name text,
-    song_release_year text,
-    age number,
-    is_male others
-)
-CREATE TABLE concert (
-    concert_id number,
-    concert_name text,
-    theme text,
-    stadium_id text,
-    year text
-)
-CREATE TABLE singer_in_concert (
-    concert_id number,
-    singer_id text
-)
--- Using valid DuckDB SQL, answer the following questions for the tables provided above.
--- What is the maximum, the average, and the minimum capacity of stadiums ?
-SELECT"""
-input_ids = tokenizer(text, return_tensors="pt").input_ids
-generated_ids = model.generate(input_ids, max_length=500)
-print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-```
-Example 2:
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
-text = """CREATE TABLE stadium (
-    stadium_id number,
-    location text,
-    name text,
-    capacity number,
-)
--- Using valid DuckDB SQL, answer the following questions for the tables provided above.
--- how many stadiums in total?
-SELECT"""
-input_ids = tokenizer(text, return_tensors="pt").input_ids
-generated_ids = model.generate(input_ids, max_length=500)
-print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-```
-Example 3:
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("motherduckdb/nsql-duckdb-7B")
-model = AutoModelForCausalLM.from_pretrained("motherduckdb/nsql-duckdb-7B", torch_dtype=torch.bfloat16)
-text = """CREATE TABLE work_orders (
-    ID NUMBER,
-    CREATED_AT TEXT,
-    COST FLOAT,
-    INVOICE_AMOUNT FLOAT,
-    IS_DUE BOOLEAN,
-    IS_OPEN BOOLEAN,
-    IS_OVERDUE BOOLEAN,
-    COUNTRY_NAME TEXT,
-)
--- Using valid DuckDB SQL, answer the following questions for the tables provided above.
--- how many work orders are open?
-SELECT"""
-input_ids = tokenizer(text, return_tensors="pt").input_ids
-generated_ids = model.generate(input_ids, max_length=500)
-print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-```
-For more information (e.g., run with your local database), please find examples in [this repository](https://github.com/NumbersStationAI/DuckDB-NSQL).

app.py CHANGED Viewed

@@ -3,12 +3,24 @@ import requests
 import subprocess
 import re
 import sys
 PROMPT_TEMPLATE = """### Instruction:\n{instruction}\n\n### Input:\n{input}\n### Question:\n{question}\n\n### Response (use duckdb shorthand if possible):\n"""
 INSTRUCTION_TEMPLATE = """Your task is to generate valid duckdb SQL to answer the following question{has_schema}"""  # noqa: E501
 ERROR_MESSAGE = ":red[ Quack! Much to our regret, SQL generation has gone a tad duck-side-down.\nThe model is currently not able to craft a correct SQL query for this request. \nSorry my duck friend. ]\n\n:red[If the question is about your own database, make sure to set the correct schema. Otherwise, try to rephrase your request. ]\n\n```sql\n{sql_query}\n```\n\n```sql\n{error_msg}\n```"
 STOP_TOKENS = ["###", ";", "--", "```"]
 def generate_prompt(question, schema):
     input = ""
@@ -34,10 +46,35 @@ def generate_prompt(question, schema):
     )
     return prompt
 def generate_sql(question, schema):
     prompt = generate_prompt(question, schema)
     s = requests.Session()
     api_base = "https://text-motherduck-sql-fp16-4vycuix6qcp2.octoai.run"
     url = f"{api_base}/v1/completions"
@@ -52,7 +89,7 @@ def generate_sql(question, schema):
     headers = {"Authorization": f"Bearer {st.secrets['octoml_token']}"}
     with s.post(url, json=body, headers=headers) as resp:
         sql_query = resp.json()["choices"][0]["text"]
     return sql_query
@@ -192,7 +229,7 @@ text_prompt = st.text_input(
 )
 if text_prompt:
-    sql_query = generate_sql(text_prompt, schema)
     valid, msg = validate_sql(sql_query, schema)
     if not valid:
         st.markdown(ERROR_MESSAGE.format(sql_query=sql_query, error_msg=msg))

 import subprocess
 import re
 import sys
+import urllib.request
+import json
+import os
+import ssl
+import time
 PROMPT_TEMPLATE = """### Instruction:\n{instruction}\n\n### Input:\n{input}\n### Question:\n{question}\n\n### Response (use duckdb shorthand if possible):\n"""
 INSTRUCTION_TEMPLATE = """Your task is to generate valid duckdb SQL to answer the following question{has_schema}"""  # noqa: E501
 ERROR_MESSAGE = ":red[ Quack! Much to our regret, SQL generation has gone a tad duck-side-down.\nThe model is currently not able to craft a correct SQL query for this request. \nSorry my duck friend. ]\n\n:red[If the question is about your own database, make sure to set the correct schema. Otherwise, try to rephrase your request. ]\n\n```sql\n{sql_query}\n```\n\n```sql\n{error_msg}\n```"
 STOP_TOKENS = ["###", ";", "--", "```"]
+def allowSelfSignedHttps(allowed):
+    # bypass the server certificate verification on client side
+    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
+        ssl._create_default_https_context = ssl._create_unverified_context
+allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.
 def generate_prompt(question, schema):
     input = ""
     )
     return prompt
+def generate_sql_azure(question, schema):
+    prompt = generate_prompt(question, schema)
+    start = time.time()
+    data={
+        "input_data": {
+            "input_string": [prompt],
+            "parameters":{
+                "top_p": 0.9,
+                "temperature": 0.1,
+                "max_new_tokens": 200,
+                "do_sample": True
+            }
+        }
+    }
+    body = str.encode(json.dumps(data))
+    url = 'https://motherduck-eu-west2-xbdfd.westeurope.inference.ml.azure.com/score'
+    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ st.secrets['azure_ai_token']), 'azureml-model-deployment': 'motherduckdb-duckdb-nsql-7b-v-1' }
+    req = urllib.request.Request(url, body, headers)
+    raw_resp = urllib.request.urlopen(req)
+    resp = json.loads(raw_resp.read().decode("utf-8"))[0]["0"]
+    sql_query = resp[len(prompt):]
+    print(time.time()-start)
+    return sql_query
 def generate_sql(question, schema):
     prompt = generate_prompt(question, schema)
+    start = time.time()
     s = requests.Session()
     api_base = "https://text-motherduck-sql-fp16-4vycuix6qcp2.octoai.run"
     url = f"{api_base}/v1/completions"
     headers = {"Authorization": f"Bearer {st.secrets['octoml_token']}"}
     with s.post(url, json=body, headers=headers) as resp:
         sql_query = resp.json()["choices"][0]["text"]
+    print(time.time()-start)
     return sql_query
 )
 if text_prompt:
+    sql_query = generate_sql_azure(text_prompt, schema)
     valid, msg = validate_sql(sql_query, schema)
     if not valid:
         st.markdown(ERROR_MESSAGE.format(sql_query=sql_query, error_msg=msg))