yuewang-sf
commited on
Commit
·
a18f941
1
Parent(s):
3565f6e
update model files
Browse files- README.md +67 -0
- added_tokens.json +5 -0
- config.json +45 -0
- configuration_codet5p_matching.py +76 -0
- merges.txt +0 -0
- modeling_codet5p_matching.py +28 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +56 -0
- tokenizer.json +0 -0
- tokenizer_config.json +64 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -1,3 +1,70 @@
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
4 |
+
|
5 |
+
# CodeT5+ 220M Bimodal Models
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
|
9 |
+
[CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models
|
10 |
+
with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_,
|
11 |
+
and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
|
12 |
+
It is introduced in the paper:
|
13 |
+
|
14 |
+
[CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
|
15 |
+
by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (*
|
16 |
+
indicates equal contribution).
|
17 |
+
|
18 |
+
Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of
|
19 |
+
pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code
|
20 |
+
matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
|
21 |
+
Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model
|
22 |
+
components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale
|
23 |
+
up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
|
24 |
+
Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B)
|
25 |
+
following [Code Alpaca](https://github.com/sahil280114/codealpaca).
|
26 |
+
|
27 |
+
## How to use
|
28 |
+
|
29 |
+
This model can be easily loaded using the `AutoModel` functionality and employs the [CodeT5](https://github.com/salesforce/CodeT5) tokenizer with three special tokens added (`[ENC]`, `[TDEC]`, `[CDEC]`).
|
30 |
+
This checkpoint consists of a CodeT5+ 220M model and a projection layer and an itm_head layer for text-code matching.
|
31 |
+
|
32 |
+
```python
|
33 |
+
from transformers import AutoModel, AutoTokenizer
|
34 |
+
|
35 |
+
checkpoint = "Salesforce/codet5p-220m-bimodal"
|
36 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
37 |
+
|
38 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
39 |
+
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
40 |
+
```
|
41 |
+
|
42 |
+
## Pretraining data
|
43 |
+
|
44 |
+
This checkpoint is trained on the stricter permissive subset of the deduplicated version of
|
45 |
+
the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
|
46 |
+
The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”,
|
47 |
+
“cc0-1.0”, “unlicense”, “isc”).
|
48 |
+
Supported languages (9 in total) are as follows:
|
49 |
+
`c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
|
50 |
+
|
51 |
+
## Training procedure
|
52 |
+
|
53 |
+
This checkpoint is first trained on the unimodal code data at the first-stage pretraining and then on bimodal text-code
|
54 |
+
pair data using the proposed mixture of pretraining tasks.
|
55 |
+
Please refer to the paper for more details.
|
56 |
+
|
57 |
+
## Evaluation results
|
58 |
+
|
59 |
+
Please refer to the paper and the official GitHub repo for more details.
|
60 |
+
|
61 |
+
## BibTeX entry and citation info
|
62 |
+
|
63 |
+
```bibtex
|
64 |
+
@article{wang2023codet5plus,
|
65 |
+
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
|
66 |
+
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
|
67 |
+
journal={arXiv preprint},
|
68 |
+
year={2023}
|
69 |
+
}
|
70 |
+
```
|
added_tokens.json
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"[CDEC]": 32102,
|
3 |
+
"[ENC]": 32100,
|
4 |
+
"[TDEC]": 32101
|
5 |
+
}
|
config.json
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "Salesforce/codet5p-220m-bimodal",
|
3 |
+
"architectures": [
|
4 |
+
"CodeT5pBimodalModel"
|
5 |
+
],
|
6 |
+
"auto_map": {
|
7 |
+
"AutoConfig": "configuration_codet5p_bimodal.CodeT5pBimodalConfig",
|
8 |
+
"AutoModel": "modeling_codet5p_bimodal.CodeT5pBimodalModel"
|
9 |
+
},
|
10 |
+
"bos_token_id": 1,
|
11 |
+
"d_ff": 3072,
|
12 |
+
"d_kv": 64,
|
13 |
+
"d_model": 768,
|
14 |
+
"embed_dim": 256,
|
15 |
+
"decoder_start_token_id": 0,
|
16 |
+
"dense_act_fn": "relu",
|
17 |
+
"dropout_rate": 0.1,
|
18 |
+
"eos_token_id": 2,
|
19 |
+
"feed_forward_proj": "relu",
|
20 |
+
"gradient_checkpointing": false,
|
21 |
+
"id2label": {
|
22 |
+
"0": "LABEL_0"
|
23 |
+
},
|
24 |
+
"initializer_factor": 1.0,
|
25 |
+
"is_encoder_decoder": true,
|
26 |
+
"is_gated_act": false,
|
27 |
+
"label2id": {
|
28 |
+
"LABEL_0": 0
|
29 |
+
},
|
30 |
+
"layer_norm_epsilon": 1e-06,
|
31 |
+
"model_type": "codet5p_bimodal",
|
32 |
+
"n_positions": 512,
|
33 |
+
"num_decoder_layers": 12,
|
34 |
+
"num_heads": 12,
|
35 |
+
"num_layers": 12,
|
36 |
+
"output_past": true,
|
37 |
+
"pad_token_id": 0,
|
38 |
+
"relative_attention_max_distance": 128,
|
39 |
+
"relative_attention_num_buckets": 32,
|
40 |
+
|
41 |
+
"torch_dtype": "float32",
|
42 |
+
"transformers_version": "4.30.2",
|
43 |
+
"use_cache": true,
|
44 |
+
"vocab_size": 32103
|
45 |
+
}
|
configuration_codet5p_matching.py
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# coding=utf-8
|
2 |
+
# Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
|
3 |
+
|
4 |
+
""" CodeT5+ embedding model configuration"""
|
5 |
+
from transformers.configuration_utils import PretrainedConfig
|
6 |
+
from transformers.utils import logging
|
7 |
+
|
8 |
+
logger = logging.get_logger(__name__)
|
9 |
+
|
10 |
+
|
11 |
+
class CodeT5pMatchingConfig(PretrainedConfig):
|
12 |
+
model_type = "codet5p_matching"
|
13 |
+
keys_to_ignore_at_inference = ["past_key_values"]
|
14 |
+
attribute_map = {"hidden_size": "d_model", "num_attention_heads": "num_heads", "num_hidden_layers": "num_layers"}
|
15 |
+
|
16 |
+
def __init__(
|
17 |
+
self,
|
18 |
+
vocab_size=32103,
|
19 |
+
d_model=768,
|
20 |
+
embed_dim=256,
|
21 |
+
d_kv=64,
|
22 |
+
d_ff=3072,
|
23 |
+
num_layers=12,
|
24 |
+
num_decoder_layers=None,
|
25 |
+
num_heads=12,
|
26 |
+
relative_attention_num_buckets=32,
|
27 |
+
relative_attention_max_distance=128,
|
28 |
+
dropout_rate=0.1,
|
29 |
+
layer_norm_epsilon=1e-6,
|
30 |
+
initializer_factor=1.0,
|
31 |
+
feed_forward_proj="relu",
|
32 |
+
is_encoder_decoder=False,
|
33 |
+
use_cache=True,
|
34 |
+
pad_token_id=0,
|
35 |
+
eos_token_id=2,
|
36 |
+
**kwargs
|
37 |
+
):
|
38 |
+
self.vocab_size = vocab_size
|
39 |
+
self.d_model = d_model
|
40 |
+
self.embed_dim = embed_dim
|
41 |
+
self.d_kv = d_kv
|
42 |
+
self.d_ff = d_ff
|
43 |
+
self.num_layers = num_layers
|
44 |
+
self.num_decoder_layers = (
|
45 |
+
num_decoder_layers if num_decoder_layers is not None else self.num_layers
|
46 |
+
) # default = symmetry
|
47 |
+
self.num_heads = num_heads
|
48 |
+
self.relative_attention_num_buckets = relative_attention_num_buckets
|
49 |
+
self.relative_attention_max_distance = relative_attention_max_distance
|
50 |
+
self.dropout_rate = dropout_rate
|
51 |
+
self.layer_norm_epsilon = layer_norm_epsilon
|
52 |
+
self.initializer_factor = initializer_factor
|
53 |
+
self.feed_forward_proj = feed_forward_proj
|
54 |
+
self.use_cache = use_cache
|
55 |
+
|
56 |
+
act_info = self.feed_forward_proj.split("-")
|
57 |
+
self.dense_act_fn = act_info[-1]
|
58 |
+
self.is_gated_act = act_info[0] == "gated"
|
59 |
+
|
60 |
+
if len(act_info) > 1 and act_info[0] != "gated" or len(act_info) > 2:
|
61 |
+
raise ValueError(
|
62 |
+
f"`feed_forward_proj`: {feed_forward_proj} is not a valid activation function of the dense layer."
|
63 |
+
"Please make sure `feed_forward_proj` is of the format `gated-{ACT_FN}` or `{ACT_FN}`, e.g. "
|
64 |
+
"'gated-gelu' or 'relu'"
|
65 |
+
)
|
66 |
+
|
67 |
+
# for backwards compatibility
|
68 |
+
if feed_forward_proj == "gated-gelu":
|
69 |
+
self.dense_act_fn = "gelu_new"
|
70 |
+
|
71 |
+
super().__init__(
|
72 |
+
pad_token_id=pad_token_id,
|
73 |
+
eos_token_id=eos_token_id,
|
74 |
+
is_encoder_decoder=is_encoder_decoder,
|
75 |
+
**kwargs,
|
76 |
+
)
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
modeling_codet5p_matching.py
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# coding=utf-8
|
2 |
+
# Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
|
3 |
+
""" PyTorch CodeT5+ matching models.
|
4 |
+
The implementation is based on transformers.models.t5.modeling_t5 by adding a projection layer on T5EncoderModel
|
5 |
+
"""
|
6 |
+
|
7 |
+
from typing import Optional, Tuple, Union
|
8 |
+
import torch
|
9 |
+
from torch import nn
|
10 |
+
import torch.nn.functional as F
|
11 |
+
from transformers import T5ForConditionalGeneration
|
12 |
+
from transformers.modeling_outputs import (
|
13 |
+
BaseModelOutput,
|
14 |
+
)
|
15 |
+
from configuration_codet5p_matching import CodeT5pMatchingConfig
|
16 |
+
|
17 |
+
|
18 |
+
class CodeT5pMatchingModel(T5ForConditionalGeneration):
|
19 |
+
config_class = CodeT5pMatchingConfig
|
20 |
+
|
21 |
+
authorized_missing_keys = [
|
22 |
+
r"encoder.embed_tokens.weight",
|
23 |
+
]
|
24 |
+
|
25 |
+
def __init__(self, config: CodeT5pMatchingConfig):
|
26 |
+
super().__init__(config)
|
27 |
+
self.proj = nn.Linear(config.d_model, config.embed_dim)
|
28 |
+
self.itm_head = nn.Linear(config.d_model, 2)
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:42fb839e42789ccaa6e7ed10b6dd8b6906c09bf1e4281e20e5c9eedbea60de6c
|
3 |
+
size 892417313
|
special_tokens_map.json
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"[ENC]",
|
4 |
+
"[TDEC]",
|
5 |
+
"[CDEC]"
|
6 |
+
],
|
7 |
+
"bos_token": {
|
8 |
+
"content": "<s>",
|
9 |
+
"lstrip": false,
|
10 |
+
"normalized": true,
|
11 |
+
"rstrip": false,
|
12 |
+
"single_word": false
|
13 |
+
},
|
14 |
+
"cls_token": {
|
15 |
+
"content": "<s>",
|
16 |
+
"lstrip": false,
|
17 |
+
"normalized": true,
|
18 |
+
"rstrip": false,
|
19 |
+
"single_word": false
|
20 |
+
},
|
21 |
+
"eos_token": {
|
22 |
+
"content": "</s>",
|
23 |
+
"lstrip": false,
|
24 |
+
"normalized": true,
|
25 |
+
"rstrip": false,
|
26 |
+
"single_word": false
|
27 |
+
},
|
28 |
+
"mask_token": {
|
29 |
+
"content": "<mask>",
|
30 |
+
"lstrip": true,
|
31 |
+
"normalized": true,
|
32 |
+
"rstrip": false,
|
33 |
+
"single_word": false
|
34 |
+
},
|
35 |
+
"pad_token": {
|
36 |
+
"content": "<pad>",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": true,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false
|
41 |
+
},
|
42 |
+
"sep_token": {
|
43 |
+
"content": "</s>",
|
44 |
+
"lstrip": false,
|
45 |
+
"normalized": true,
|
46 |
+
"rstrip": false,
|
47 |
+
"single_word": false
|
48 |
+
},
|
49 |
+
"unk_token": {
|
50 |
+
"content": "<unk>",
|
51 |
+
"lstrip": false,
|
52 |
+
"normalized": true,
|
53 |
+
"rstrip": false,
|
54 |
+
"single_word": false
|
55 |
+
}
|
56 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"add_prefix_space": false,
|
3 |
+
"bos_token": {
|
4 |
+
"__type": "AddedToken",
|
5 |
+
"content": "<s>",
|
6 |
+
"lstrip": false,
|
7 |
+
"normalized": true,
|
8 |
+
"rstrip": false,
|
9 |
+
"single_word": false
|
10 |
+
},
|
11 |
+
"clean_up_tokenization_spaces": true,
|
12 |
+
"cls_token": {
|
13 |
+
"__type": "AddedToken",
|
14 |
+
"content": "<s>",
|
15 |
+
"lstrip": false,
|
16 |
+
"normalized": true,
|
17 |
+
"rstrip": false,
|
18 |
+
"single_word": false
|
19 |
+
},
|
20 |
+
"eos_token": {
|
21 |
+
"__type": "AddedToken",
|
22 |
+
"content": "</s>",
|
23 |
+
"lstrip": false,
|
24 |
+
"normalized": true,
|
25 |
+
"rstrip": false,
|
26 |
+
"single_word": false
|
27 |
+
},
|
28 |
+
"errors": "replace",
|
29 |
+
"mask_token": {
|
30 |
+
"__type": "AddedToken",
|
31 |
+
"content": "<mask>",
|
32 |
+
"lstrip": true,
|
33 |
+
"normalized": true,
|
34 |
+
"rstrip": false,
|
35 |
+
"single_word": false
|
36 |
+
},
|
37 |
+
"model_max_length": 512,
|
38 |
+
"pad_token": {
|
39 |
+
"__type": "AddedToken",
|
40 |
+
"content": "<pad>",
|
41 |
+
"lstrip": false,
|
42 |
+
"normalized": true,
|
43 |
+
"rstrip": false,
|
44 |
+
"single_word": false
|
45 |
+
},
|
46 |
+
"sep_token": {
|
47 |
+
"__type": "AddedToken",
|
48 |
+
"content": "</s>",
|
49 |
+
"lstrip": false,
|
50 |
+
"normalized": true,
|
51 |
+
"rstrip": false,
|
52 |
+
"single_word": false
|
53 |
+
},
|
54 |
+
"tokenizer_class": "RobertaTokenizer",
|
55 |
+
"trim_offsets": true,
|
56 |
+
"unk_token": {
|
57 |
+
"__type": "AddedToken",
|
58 |
+
"content": "<unk>",
|
59 |
+
"lstrip": false,
|
60 |
+
"normalized": true,
|
61 |
+
"rstrip": false,
|
62 |
+
"single_word": false
|
63 |
+
}
|
64 |
+
}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|