SageLite/SageLite-s · Hugging Face

SageLite-s

Model Description

SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:

MLM Pretraining: Standard masked language model (MLM) pretraining on mixed code and text data (The-Stack-v2 and Falcon-refinedweb).
Contrastive Pre-Finetuning: Learning from a large amount of positive pairs mined from web data and GitHub.
Contrastive Fine-Tuning: Fine-tuning on a small amount of synthetic data.

Training Data

This checkpoint is trained on both The-Stack-v2 and Falcon-refinedweb. Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.

How to Use

This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the Starcoder Tokenizer.

from transformers import AutoModel, AutoTokenizer

# Specify the checkpoint
checkpoint = "SageLite/SageLite-s"
device = "cuda"  # Use "cpu" if GPU is unavailable

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0]  # Extract the embedding

Code Retrieval Performance

1. Code2Code Search

Model Name	# Params	Embd Dim	Python	Java	JS	TS	C#	C	Ruby	PhP	GO	AVG
OpenAI-Code-01	NA	3072	21.92	8.90	4.90	5.70	3.15	11.58	26.25	16.60	9.40	12.04
OpenAI-Text-3-Small	NA	1536	25.18	12.61	8.00	9.44	5.46	15.86	30.70	23.33	11.20	15.57
OpenAI-Text-3-Large	NA	3072	40.57	25.33	20.09	22.00	11.84	31.90	42.54	41.84	21.75	28.65
CodeSage-v2-Small	130M	1024	45.60	33.65	39.96	47.78	19.19	30.55	40.12	55.39	30.96	38.13
CodeSage-v2-Base	356M	1024	55.86	42.89	45.29	54.58	23.90	38.52	56.02	64.56	42.88	47.17
CodeSage-v2-Large	1.3B	2048	61.11	47.09	51.18	60.67	28.04	43.40	60.74	67.87	43.86	51.55
SageLite-s	80M	768	47.93	30.83	35.15	37.64	18.14	30.53	42.89	50.70	21.69	35.06
SageLite-l	850M	1536	64.46	45.53	50.80	54.71	30.66	47.46	61.01	68.68	39.25	51.40

2. NL2Code Search

Model Name	# Params	CoSQA	AdvTest	Python	Java	JS	PhP	GO	Ruby	Avg
OpenAI-Code-01	NA	52.20	36.03	63.13	67.85	62.30	57.47	85.22	69.28	61.69
OpenAI-Text-3-Small	NA	52.48	34.10	62.62	65.87	60.28	54.85	81.96	67.57	59.97
OpenAI-Text-3-Large	NA	55.21	46.83	70.81	72.89	68.12	59.58	87.60	75.22	67.03
CodeSage-v2-Small	130M	52.39	47.28	68.79	68.13	65.77	60.20	80.26	72.46	64.41
CodeSage-v2-Base	356M	50.74	52.00	70.46	70.89	69.61	62.81	82.37	73.71	66.57
CodeSage-v2-Large	1.3B	53.18	56.31	74.18	72.33	72.49	65.26	84.67	76.61	69.38
SageLite-s	80M	56.49	42.32	67.59	66.62	62.32	58.87	79.36	70.75	63.04
SageLite-l	850M	59.76	55.55	74.25	71.76	69.35	61.62	84.09	77.14	69.19

Text Retrieval Performance (MTEB Retrieval)

Metric	SageLite-s	SageLite-l
ArguAna	57.75	60.71
CQADupstackWordpressRetrieval	32.42	38.63
FiQA2018	34.85	46.73
NFCorpus	29.97	33.70
QuoraRetrieval	85.35	87.50
SCIDOCS	18.99	21.38
SciFact	68.43	69.05
Touche2020	24.41	21.43
TRECCOVID	70.88	76.08
FEVER	71.72	73.64
HotpotQA	58.81	62.96
NQ	48.26	54.48
DBPedia	34.83	40.69
ClimateFEVER	25.69	26.20
MSMARCO	35.01	36.55
average	46.49	49.98

SageLite
/

SageLite-s