yuewang-sf commited on
Commit
a18f941
·
1 Parent(s): 3565f6e

update model files

Browse files
README.md CHANGED
@@ -1,3 +1,70 @@
1
  ---
2
  license: bsd-3-clause
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause
3
  ---
4
+
5
+ # CodeT5+ 220M Bimodal Models
6
+
7
+ ## Model description
8
+
9
+ [CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models
10
+ with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_,
11
+ and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
12
+ It is introduced in the paper:
13
+
14
+ [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
15
+ by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (*
16
+ indicates equal contribution).
17
+
18
+ Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of
19
+ pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code
20
+ matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
21
+ Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model
22
+ components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale
23
+ up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
24
+ Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B)
25
+ following [Code Alpaca](https://github.com/sahil280114/codealpaca).
26
+
27
+ ## How to use
28
+
29
+ This model can be easily loaded using the `AutoModel` functionality and employs the [CodeT5](https://github.com/salesforce/CodeT5) tokenizer with three special tokens added (`[ENC]`, `[TDEC]`, `[CDEC]`).
30
+ This checkpoint consists of a CodeT5+ 220M model and a projection layer and an itm_head layer for text-code matching.
31
+
32
+ ```python
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ checkpoint = "Salesforce/codet5p-220m-bimodal"
36
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
39
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
40
+ ```
41
+
42
+ ## Pretraining data
43
+
44
+ This checkpoint is trained on the stricter permissive subset of the deduplicated version of
45
+ the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
46
+ The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”,
47
+ “cc0-1.0”, “unlicense”, “isc”).
48
+ Supported languages (9 in total) are as follows:
49
+ `c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
50
+
51
+ ## Training procedure
52
+
53
+ This checkpoint is first trained on the unimodal code data at the first-stage pretraining and then on bimodal text-code
54
+ pair data using the proposed mixture of pretraining tasks.
55
+ Please refer to the paper for more details.
56
+
57
+ ## Evaluation results
58
+
59
+ Please refer to the paper and the official GitHub repo for more details.
60
+
61
+ ## BibTeX entry and citation info
62
+
63
+ ```bibtex
64
+ @article{wang2023codet5plus,
65
+ title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
66
+ author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
67
+ journal={arXiv preprint},
68
+ year={2023}
69
+ }
70
+ ```
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "[CDEC]": 32102,
3
+ "[ENC]": 32100,
4
+ "[TDEC]": 32101
5
+ }
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Salesforce/codet5p-220m-bimodal",
3
+ "architectures": [
4
+ "CodeT5pBimodalModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_codet5p_bimodal.CodeT5pBimodalConfig",
8
+ "AutoModel": "modeling_codet5p_bimodal.CodeT5pBimodalModel"
9
+ },
10
+ "bos_token_id": 1,
11
+ "d_ff": 3072,
12
+ "d_kv": 64,
13
+ "d_model": 768,
14
+ "embed_dim": 256,
15
+ "decoder_start_token_id": 0,
16
+ "dense_act_fn": "relu",
17
+ "dropout_rate": 0.1,
18
+ "eos_token_id": 2,
19
+ "feed_forward_proj": "relu",
20
+ "gradient_checkpointing": false,
21
+ "id2label": {
22
+ "0": "LABEL_0"
23
+ },
24
+ "initializer_factor": 1.0,
25
+ "is_encoder_decoder": true,
26
+ "is_gated_act": false,
27
+ "label2id": {
28
+ "LABEL_0": 0
29
+ },
30
+ "layer_norm_epsilon": 1e-06,
31
+ "model_type": "codet5p_bimodal",
32
+ "n_positions": 512,
33
+ "num_decoder_layers": 12,
34
+ "num_heads": 12,
35
+ "num_layers": 12,
36
+ "output_past": true,
37
+ "pad_token_id": 0,
38
+ "relative_attention_max_distance": 128,
39
+ "relative_attention_num_buckets": 32,
40
+
41
+ "torch_dtype": "float32",
42
+ "transformers_version": "4.30.2",
43
+ "use_cache": true,
44
+ "vocab_size": 32103
45
+ }
configuration_codet5p_matching.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
3
+
4
+ """ CodeT5+ embedding model configuration"""
5
+ from transformers.configuration_utils import PretrainedConfig
6
+ from transformers.utils import logging
7
+
8
+ logger = logging.get_logger(__name__)
9
+
10
+
11
+ class CodeT5pMatchingConfig(PretrainedConfig):
12
+ model_type = "codet5p_matching"
13
+ keys_to_ignore_at_inference = ["past_key_values"]
14
+ attribute_map = {"hidden_size": "d_model", "num_attention_heads": "num_heads", "num_hidden_layers": "num_layers"}
15
+
16
+ def __init__(
17
+ self,
18
+ vocab_size=32103,
19
+ d_model=768,
20
+ embed_dim=256,
21
+ d_kv=64,
22
+ d_ff=3072,
23
+ num_layers=12,
24
+ num_decoder_layers=None,
25
+ num_heads=12,
26
+ relative_attention_num_buckets=32,
27
+ relative_attention_max_distance=128,
28
+ dropout_rate=0.1,
29
+ layer_norm_epsilon=1e-6,
30
+ initializer_factor=1.0,
31
+ feed_forward_proj="relu",
32
+ is_encoder_decoder=False,
33
+ use_cache=True,
34
+ pad_token_id=0,
35
+ eos_token_id=2,
36
+ **kwargs
37
+ ):
38
+ self.vocab_size = vocab_size
39
+ self.d_model = d_model
40
+ self.embed_dim = embed_dim
41
+ self.d_kv = d_kv
42
+ self.d_ff = d_ff
43
+ self.num_layers = num_layers
44
+ self.num_decoder_layers = (
45
+ num_decoder_layers if num_decoder_layers is not None else self.num_layers
46
+ ) # default = symmetry
47
+ self.num_heads = num_heads
48
+ self.relative_attention_num_buckets = relative_attention_num_buckets
49
+ self.relative_attention_max_distance = relative_attention_max_distance
50
+ self.dropout_rate = dropout_rate
51
+ self.layer_norm_epsilon = layer_norm_epsilon
52
+ self.initializer_factor = initializer_factor
53
+ self.feed_forward_proj = feed_forward_proj
54
+ self.use_cache = use_cache
55
+
56
+ act_info = self.feed_forward_proj.split("-")
57
+ self.dense_act_fn = act_info[-1]
58
+ self.is_gated_act = act_info[0] == "gated"
59
+
60
+ if len(act_info) > 1 and act_info[0] != "gated" or len(act_info) > 2:
61
+ raise ValueError(
62
+ f"`feed_forward_proj`: {feed_forward_proj} is not a valid activation function of the dense layer."
63
+ "Please make sure `feed_forward_proj` is of the format `gated-{ACT_FN}` or `{ACT_FN}`, e.g. "
64
+ "'gated-gelu' or 'relu'"
65
+ )
66
+
67
+ # for backwards compatibility
68
+ if feed_forward_proj == "gated-gelu":
69
+ self.dense_act_fn = "gelu_new"
70
+
71
+ super().__init__(
72
+ pad_token_id=pad_token_id,
73
+ eos_token_id=eos_token_id,
74
+ is_encoder_decoder=is_encoder_decoder,
75
+ **kwargs,
76
+ )
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_codet5p_matching.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
3
+ """ PyTorch CodeT5+ matching models.
4
+ The implementation is based on transformers.models.t5.modeling_t5 by adding a projection layer on T5EncoderModel
5
+ """
6
+
7
+ from typing import Optional, Tuple, Union
8
+ import torch
9
+ from torch import nn
10
+ import torch.nn.functional as F
11
+ from transformers import T5ForConditionalGeneration
12
+ from transformers.modeling_outputs import (
13
+ BaseModelOutput,
14
+ )
15
+ from configuration_codet5p_matching import CodeT5pMatchingConfig
16
+
17
+
18
+ class CodeT5pMatchingModel(T5ForConditionalGeneration):
19
+ config_class = CodeT5pMatchingConfig
20
+
21
+ authorized_missing_keys = [
22
+ r"encoder.embed_tokens.weight",
23
+ ]
24
+
25
+ def __init__(self, config: CodeT5pMatchingConfig):
26
+ super().__init__(config)
27
+ self.proj = nn.Linear(config.d_model, config.embed_dim)
28
+ self.itm_head = nn.Linear(config.d_model, 2)
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42fb839e42789ccaa6e7ed10b6dd8b6906c09bf1e4281e20e5c9eedbea60de6c
3
+ size 892417313
special_tokens_map.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "[ENC]",
4
+ "[TDEC]",
5
+ "[CDEC]"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<s>",
9
+ "lstrip": false,
10
+ "normalized": true,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "cls_token": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "eos_token": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "mask_token": {
29
+ "content": "<mask>",
30
+ "lstrip": true,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "pad_token": {
36
+ "content": "<pad>",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false
41
+ },
42
+ "sep_token": {
43
+ "content": "</s>",
44
+ "lstrip": false,
45
+ "normalized": true,
46
+ "rstrip": false,
47
+ "single_word": false
48
+ },
49
+ "unk_token": {
50
+ "content": "<unk>",
51
+ "lstrip": false,
52
+ "normalized": true,
53
+ "rstrip": false,
54
+ "single_word": false
55
+ }
56
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "clean_up_tokenization_spaces": true,
12
+ "cls_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "eos_token": {
21
+ "__type": "AddedToken",
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "errors": "replace",
29
+ "mask_token": {
30
+ "__type": "AddedToken",
31
+ "content": "<mask>",
32
+ "lstrip": true,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "model_max_length": 512,
38
+ "pad_token": {
39
+ "__type": "AddedToken",
40
+ "content": "<pad>",
41
+ "lstrip": false,
42
+ "normalized": true,
43
+ "rstrip": false,
44
+ "single_word": false
45
+ },
46
+ "sep_token": {
47
+ "__type": "AddedToken",
48
+ "content": "</s>",
49
+ "lstrip": false,
50
+ "normalized": true,
51
+ "rstrip": false,
52
+ "single_word": false
53
+ },
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": {
57
+ "__type": "AddedToken",
58
+ "content": "<unk>",
59
+ "lstrip": false,
60
+ "normalized": true,
61
+ "rstrip": false,
62
+ "single_word": false
63
+ }
64
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff