hhou435 commited on
Commit
629fe90
1 Parent(s): 530746d
README.md CHANGED
@@ -1,77 +1,157 @@
1
  ---
2
- language: All languages
3
- datasets: ISML datasets (80 thousands hours unlabeled data) + babel datasets (2 thousands unlabeled data)
 
 
 
 
 
 
 
 
4
 
5
- # Chinese W2v-conformer
6
  ## Model description
7
- This is the set of Speech W2v-conformer model pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/):
 
 
 
 
 
 
8
 
9
  ## How to use
10
- You can use the model directly with a pipeline for speech recognition:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ```python
12
- >>> from wenet.dataset.dataset import CollateFunc, AudioDataset
13
- >>> from wenet.transformer.asr_model import ASRModel
14
- >>> from wenet.transformer.encoder import ConformerEncoder
15
- >>> from wenet.transformer.decoder import TransformerDecoder
16
- >>> from wenet.transformer.ctc import CTC
17
- >>> from wenet.utils.executor import Executor
18
- >>> from wenet.utils.checkpoint import save_checkpoint, load_checkpoint
19
- >>> encoder = ConformerEncoder(input_dim, **configs['encoder_conf'])
20
- >>> decoder = TransformerDecoder(vocab_size, encoder.output_size(), **configs['decoder_conf'])
21
- >>> ctc = CTC(vocab_size, encoder.output_size())
22
- >>> with open(args.config, 'r') as fin: configs = yaml.load(fin)
23
- >>> model = ASRModel(
24
- vocab_size=vocab_size,
25
- encoder=encoder,
26
- decoder=decoder,
27
- ctc=ctc,
28
- **configs['model_conf'],
29
- )
30
- >>> infos = load_checkpoint(model, args.checkpoint)
31
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
  ## Training data
35
- ISML datasets (80 thousands hours unlabeled data) and babel datasets (2 thousands unlabeled data) are used as training data.
 
 
36
  ## Training procedure
37
- The model is pre-trained by wav2vec2 (https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 70 epochs with a batch size of 128. We use the same hyper-parameters on different model sizes.
38
- The downstream models are finetuned:
39
 
40
- Stage 1:
 
 
 
 
 
41
  ```
42
- python wenet/bin/train.py --gpu 0,1,2,3,4,5,6,7 \
43
- --config $train_config \
44
- --train_data train.data \
45
- --cv_data dev.data \
46
- ${checkpoint:+--checkpoint $checkpoint} \
47
- --model_dir $dir \
48
- --ddp.init_method $init_method \
49
- --ddp.world_size 7 \
50
- --ddp.dist_backend nccl \
51
- --num_workers 2
52
  ```
53
 
54
- ### BibTeX entry and citation info
55
  ```
56
- @article{baevski2020wav2vec,
57
- title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
58
- author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
59
- journal={arXiv preprint arXiv:2006.11477},
60
- year={2020}
61
- }
 
 
 
 
62
 
63
- @article{zhang2020pushing,
64
- title={Pushing the limits of semi-supervised learning for automatic speech recognition},
65
- author={Zhang, Yu and Qin, James and Park, Daniel S and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V and Wu, Yonghui},
66
- journal={arXiv preprint arXiv:2010.10504},
67
- year={2020}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  }
69
 
70
- @article{zhang2021wenet,
71
- title={WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit},
72
- author={Zhang, Binbin and Wu, Di and Yang, Chao and Chen, Xiaoyu and Peng, Zhendong and Wang, Xiangming and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Xie, Lei and others},
73
- journal={arXiv preprint arXiv:2102.01547},
74
- year={2021}
 
75
  }
76
  ```
77
  [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
 
1
  ---
2
+ language: Chinese
3
+ datasets: CLUECorpusSmall
4
+ widget:
5
+ - text: "中国的首都是[MASK]京"
6
+
7
+
8
+ ---
9
+
10
+
11
+ # Chinese ALBERT
12
 
 
13
  ## Model description
14
+
15
+ This is the set of Chinese ALBERT models pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
16
+
17
+ | | Link |
18
+ | -------- | :-----------------------: |
19
+ | **ALBERT-Base** | [**L=12/H=768 (Base)**][base] |
20
+ | **ALBERT-Large** | [**L=24/H=1024 (Large)**][large] |
21
 
22
  ## How to use
23
+
24
+ You can use the model directly with a pipeline for text generation:
25
+
26
+ ```python
27
+ >>> from transformers import BertTokenizer, AlbertForMaskedLM, FillMaskPipeline
28
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
29
+ >>> model = AlbertForMaskedLM.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
30
+ >>> unmasker = FillMaskPipeline(model, tokenizer)
31
+ >>> unmasker("中国的首都是[MASK]京。")
32
+ [
33
+ {'sequence': '中 国 的 首 都 是 北 京 。',
34
+ 'score': 0.8528032898902893,
35
+ 'token': 1266,
36
+ 'token_str': '北'},
37
+ {'sequence': '中 国 的 首 都 是 南 京 。',
38
+ 'score': 0.07667620480060577,
39
+ 'token': 1298,
40
+ 'token_str': '南'},
41
+ {'sequence': '中 国 的 首 都 是 东 京 。',
42
+ 'score': 0.020440367981791496,
43
+ 'token': 691,
44
+ 'token_str': '东'},
45
+ {'sequence': '中 国 的 首 都 是 维 京 。',
46
+ 'score': 0.010197942145168781,
47
+ 'token': 5335,
48
+ 'token_str': '维'},
49
+ {'sequence': '中 国 的 首 都 是 汴 京 。',
50
+ 'score': 0.0075391442514956,
51
+ 'token': 3745,
52
+ 'token_str': '汴'}
53
+ ]
54
+
55
+ ```
56
+
57
+ Here is how to use this model to get the features of a given text in PyTorch:
58
+
59
  ```python
60
+ from transformers import BertTokenizer, AlbertModel
61
+ tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
62
+ model = AlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
63
+ text = "用你喜欢的任何文本替换我。"
64
+ encoded_input = tokenizer(text, return_tensors='pt')
65
+ output = model(**encoded_input)
66
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
+ and in TensorFlow:
69
+
70
+ ```python
71
+ from transformers import BertTokenizer, TFAlbertModel
72
+ tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
73
+ model = TFAlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
74
+ text = "用你喜欢的任何文本替换我。"
75
+ encoded_input = tokenizer(text, return_tensors='tf')
76
+ output = model(encoded_input)
77
  ```
78
 
79
  ## Training data
80
+
81
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
82
+
83
  ## Training procedure
 
 
84
 
85
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
86
+
87
+ Taking the case of ALBERT-Base
88
+
89
+ Stage1:
90
+
91
  ```
92
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
93
+ --vocab_path models/google_zh_vocab.txt \
94
+ --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
95
+ --seq_length 128 --processes_num 32 --target albert
 
 
 
 
 
 
96
  ```
97
 
 
98
  ```
99
+ python3 pretrain.py --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
100
+ --vocab_path models/google_zh_vocab.txt \
101
+ --config_path models/albert/base_config.json \
102
+ --output_model_path models/cluecorpussmall_albert_base_seq128_model.bin \
103
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
104
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
105
+ --learning_rate 1e-4 --batch_size 64 \
106
+ --factorized_embedding_parameterization --parameter_sharing \
107
+ --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
108
+ ```
109
 
110
+ Stage2:
111
+
112
+ ```
113
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
114
+ --vocab_path models/google_zh_vocab.txt \
115
+ --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
116
+ --seq_length 512 --processes_num 32 --target albert
117
+ ```
118
+
119
+ ```
120
+ python3 pretrain.py --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
121
+ --pretrained_model_path models/cluecorpussmall_albert_base_seq128_model.bin-1000000 \
122
+ --vocab_path models/google_zh_vocab.txt \
123
+ --config_path models/albert/base_config.json \
124
+ --output_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
125
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
126
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
127
+ --learning_rate 1e-4 --batch_size 64 \
128
+ --factorized_embedding_parameterization --parameter_sharing \
129
+ --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
130
+ ```
131
+
132
+ Finally, we convert the pre-trained model into Huggingface's format:
133
+
134
+ ```
135
+ python3 scripts/convert_albert_from_uer_to_huggingface.py --input_model_path cluecorpussmall_albert_base_seq512_model.bin-250000 \
136
+ --output_model_path pytorch_model.bin
137
+ ```
138
+
139
+ ### BibTeX entry and citation info
140
+
141
+ ```
142
+ @article{lan2019albert,
143
+ title={Albert: A lite bert for self-supervised learning of language representations},
144
+ author={Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu},
145
+ journal={arXiv preprint arXiv:1909.11942},
146
+ year={2019}
147
  }
148
 
149
+ @article{zhao2019uer,
150
+ title={UER: An Open-Source Toolkit for Pre-training Models},
151
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
152
+ journal={EMNLP-IJCNLP 2019},
153
+ pages={241},
154
+ year={2019}
155
  }
156
  ```
157
  [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "albert",
3
+ "architectures": [
4
+ "AlbertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0,
7
+ "bos_token_id": 2,
8
+ "classifier_dropout_prob": 0.1,
9
+ "embedding_size": 128,
10
+ "eos_token_id": 3,
11
+ "hidden_act": "relu",
12
+ "hidden_dropout_prob": 0,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "inner_group_num": 1,
16
+ "intermediate_size": 3072,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 512,
19
+ "model_type": "albert",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_groups": 1,
22
+ "num_hidden_layers": 12,
23
+ "pad_token_id": 0,
24
+ "position_embedding_type": "absolute",
25
+ "tokenizer_class": "BertTokenizer",
26
+ "transformers_version": "4.6.0",
27
+ "type_vocab_size": 2,
28
+ "vocab_size": 21128
29
+ }
69.pt → pytorch_model.bin RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:eee2e05f3ca00624ab4a5bac31ca538f05d1c2ccd975f85074dc4c3ad13793b4
3
- size 562887224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e90c5f6b64fda667d9a10a8065878a4790515a0df171e361787354b25526141
3
+ size 40325143
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00b2f0b8fa2b513f5dde4fe14f25978c459e1381cb7ff0fd259fc98c4a6b4d61
3
+ size 51528256
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}
train_conformer_large_w2v.yaml DELETED
@@ -1,119 +0,0 @@
1
- # network architecture
2
- # encoder related
3
- encoder: conformer
4
- encoder_conf:
5
- output_size: 512 # dimension of attention
6
- attention_heads: 8
7
- linear_units: 2048 # the number of units of position-wise feed forward
8
- num_blocks: 18 # the number of encoder blocks
9
- dropout_rate: 0.1
10
- positional_dropout_rate: 0.0
11
- attention_dropout_rate: 0.0
12
- input_layer: conv2d6 # encoder input type, you can chose conv2d, conv2d6 and conv2d8
13
- normalize_before: true
14
- cnn_module_kernel: 15
15
- use_cnn_module: True
16
- activation_type: 'swish'
17
- macaron_style: True
18
- pos_enc_layer_type: 'rel_pos'
19
- selfattention_layer_type: 'abs_selfattn'
20
- nonorm: False
21
- cnn_prev: True
22
- cnn_after: False
23
-
24
- # decoder related
25
- decoder: transformer
26
- decoder_conf:
27
- attention_heads: 4
28
- linear_units: 2048
29
- num_blocks: 1
30
- dropout_rate: 0.0
31
- positional_dropout_rate: 0.0
32
- self_attention_dropout_rate: 0.0
33
- src_attention_dropout_rate: 0.0
34
-
35
- # hybrid CTC/attention
36
- model_conf:
37
- ctc_weight: 1.0
38
- lsm_weight: 0.1 # label smoothing option
39
- length_normalized_loss: false
40
-
41
- raw_wav: False
42
- data_save: True
43
- use_gc: True
44
-
45
- w2v_encoder: True
46
- pretrain: True
47
- random_pretrain: False
48
- wav2vec: True
49
- w2v_coef: 1.0
50
-
51
- mpc_didi_ver: False
52
- wav2mpc: False
53
- wav2mpc_reduction: False
54
- mpc_mask_loss: False
55
- mpc_coef: 0.0
56
-
57
- mask: True
58
- quantize_targets: True
59
- project_targets: True
60
- latent_vars: 320
61
- w2v_reduct: True
62
- w2v_ext_loss: True
63
- w2v_loss_weights: [0.1,0]
64
-
65
- w2v_mask_prob: 0.65
66
- mpc_prob: 0.5
67
-
68
- remove_valbest: False
69
-
70
- model:
71
- method: 'npc' # Accepts npc/apc/vqapc
72
- paras:
73
- kernel_size: 15 # Receptive field size (R) = kernel_size + 2*(n_blocks)
74
- mask_size: 5 # Desired input mask size (M_in) as described in NPC paper
75
- n_blocks: 4 # Number of ConvBlocks stacked in NPC model
76
- hidden_size: 512 # Dimension of feature of all layers
77
- dropout: 0.1 # Dropout in ConvBlock
78
- residual: True # Residual connection in ConvBlock
79
- batch_norm: True # Apply BatchNorm in ConvBlock
80
- activate: 'relu' # Activation function of ConvBlock
81
- disable_cross_layer: False # Apply Masked ConvBlock at last layer only
82
- vq:
83
- codebook_size: [64,64,64,64] # Codebook size of each group in VQ-layer
84
- code_dim: [128,128,128,128] # Dim of each group summing up to hidden_size
85
- gumbel_temperature: 1.0 # Temperature of Gumbel Softmax in VQ-layer
86
-
87
- collate_conf:
88
- spec_aug: false
89
-
90
- # specaugmentation related
91
- spec_aug_conf:
92
- num_time_mask: 2
93
- num_freq_mask: 2
94
- max_time_mask: 50
95
- max_freq_mask: 10
96
- max_time_warp: 80
97
- gauss_mask_for_time: False
98
- warp_for_time: False
99
-
100
- # dataset related
101
- dataset_conf:
102
- max_length: 4500
103
- min_length: 80
104
- max_frames_in_batch: 16000
105
- batch_type: 'dynamic' # static or dynamic
106
- batch_size: 20
107
- sort: true
108
-
109
- grad_clip: 10
110
- accum_grad: 2
111
- max_epoch: 180
112
- log_interval: 100
113
-
114
- optim: adam
115
- optim_conf:
116
- lr: 0.001
117
- scheduler: warmuplr # pytorch v1.1.0+ required
118
- scheduler_conf:
119
- warmup_steps: 10000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vocab.txt ADDED
The diff for this file is too large to render. See raw diff