Spaces:

yhavinga
/

dutch-tokenizer-arena

Running

App Files Files Community

dutch-tokenizer-arena / vocab /gpt2 /README.md

xu-song

update

751936e over 1 year ago

preview code

raw

history blame

1.99 kB



	# 背景知识

	GPT2采用的byte-level BPE，BERT采用的char-level BPE。


	- BPE on unicode sequence
	- BPE on UTF-8 byte sequence
	-

	来自 https://huggingface.co./gpt2/tree/main

	### BPE的问题


	- 直接BPE，会出现 dog. dog! 等合并成一个词。

	byte-level BPE

	- bpe会把空格拼接到后一个词上，比如 bpe.decode(bpes[1:2]) = ' world'，在NER任务上是不是算把空格也标注进去了？
	- bpe会把 'world'和' world'视为两个完全不同的token，不好吧？
	- 大小写：


	### 怎样解决



	### GPT2的



	# 下载

	### 官方

	### huggingface = 官方

	- [vocab.json](https://huggingface.co./gpt2-large/resolve/main/vocab.json): 50257个kv-pair. https://huggingface.co./gpt2/resolve/main/vocab.json
	- [merges.txt](https://huggingface.co./gpt2-large/resolve/main/merges.txt): 50001行，https://huggingface.co./gpt2/resolve/main/merges.txt
	- merges.txts是否包含所有的组合？https://github.com/huggingface/transformers/issues/4777

	### fairseq = 官方

	- vocab.bpe：50001行
	- encoder.json: 50257个kv-pair
	- dict.txt: 50260行是纯数字的，是由fairseq-preprocess生成的 https://github.com/pytorch/fairseq/issues/1186


	- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
	- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
	- https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt


	# 相关疑问

	### Ġ是什么

	It's a feature of byte-level BPE(an encoded space character).
	Ġ 表示空格，有的版本用Ä代替Ġ。


	```
	What's up with the tokenizer?
	# BPE后
	['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
	# 经过vocab.json编码后
	[ 2061, 338, 510, 351, 262, 11241, 7509, 30]
	# 经过dict.txt编码后（fairseq特有）
	[ 其他数字 ]
	```
	疑问：up会加Ġ，为什么what不加Ġ


	- https://github.com/pytorch/fairseq/issues/1716
	- https://github.com/huggingface/transformers/issues/1083