--- language: - zh tags: - t5 - pytorch - zh - Text2Text-Generation license: "apache-2.0" widget: - text: "对联:丹枫江冷人初去" --- # T5 for Chinese Couplet(t5-chinese-couplet) Model T5中文对联生成模型 `t5-chinese-couplet` evaluate couplet test data: The overall performance of T5 on couplet **test**: |prefix|input_text|target_text|pred| |:-- |:--- |:--- |:-- | |对联:|春回大地,对对黄莺鸣暖树|日照神州,群群紫燕衔新泥|福至人间,家家紫燕舞和风| 在Couplet测试集上生成结果满足字数相同、词性对齐、词面对齐、形似要求,而语义对仗工整和平仄合律还不满足。 T5的网络结构(原生T5): ![arch](t5.png) ## Usage 本项目开源在文本生成项目:[textgen](https://github.com/shibing624/textgen),可支持T5模型,通过如下命令调用: Install package: ```shell pip install -U textgen ``` ```python from textgen import T5Model model = T5Model("t5", "shibing624/t5-chinese-couplet") r = model.predict(["对联:丹枫江冷人初去"]) print(r) # ['白石矶寒客不归'] ``` ## Usage (HuggingFace Transformers) Without [textgen](https://github.com/shibing624/textgen), you can use the model like this: First, you pass your input through the transformer model, then you get the generated sentence. Install package: ``` pip install transformers ``` ```python from transformers import T5ForConditionalGeneration, T5Tokenizer tokenizer = T5Tokenizer.from_pretrained("shibing624/t5-chinese-couplet") model = T5ForConditionalGeneration.from_pretrained("shibing624/t5-chinese-couplet") def batch_generate(input_texts, max_length=64): features = tokenizer(input_texts, return_tensors='pt') outputs = model.generate(input_ids=features['input_ids'], attention_mask=features['attention_mask'], max_length=max_length) return tokenizer.batch_decode(outputs, skip_special_tokens=True) r = batch_generate(["对联:丹枫江冷人初去"]) print(r) ``` output: ```shell ['白石矶寒客不归'] ``` 模型文件组成: ``` t5-chinese-couplet ├── config.json ├── model_args.json ├── pytorch_model.bin ├── special_tokens_map.json ├── tokenizer_config.json ├── spiece.model └── vocab.txt ``` ### 训练数据集 #### 中文对联数据集 - 数据:[对联github](https://github.com/wb14123/couplet-dataset)、[清洗过的对联github](https://github.com/v-zich/couplet-clean-dataset) - 相关内容 - [Huggingface](https://huggingface.co./) - LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co./Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf) - [textgen](https://github.com/shibing624/textgen) 数据格式: ```text head -n 1 couplet_files/couplet/train/in.txt 晚 风 摇 树 树 还 挺 head -n 1 couplet_files/couplet/train/out.txt 晨 露 润 花 花 更 红 ``` 如果需要训练T5模型,请参考[https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md](https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md) ## Citation ```latex @software{textgen, author = {Xu Ming}, title = {textgen: Implementation of Text Generation models}, year = {2022}, url = {https://github.com/shibing624/textgen}, } ```