update tokenization_note.md

main
yangapku 1 year ago
parent 42bd2fa694
commit dcfc400881

@ -204,7 +204,7 @@ from transformers import AutoTokenizer
>>> tokenizer("我是一只猫") >>> tokenizer("我是一只猫")
{'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]} {'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]}
``` ```
Note: You need the latest tokenizer code, i.e., after 2013-10-08, to use the `extra_vocab_file` argument. Note: You need the latest tokenizer code, i.e., after 2023-10-08, to use the `extra_vocab_file` argument.
Otherwise, you need to manually append `qwen.tiktoken` (of which path varies with your configuration) with the content from `qwen_extra.tiktoken`. Otherwise, you need to manually append `qwen.tiktoken` (of which path varies with your configuration) with the content from `qwen_extra.tiktoken`.
Certainly, you will need to finetune the model for the new tokens to work. Certainly, you will need to finetune the model for the new tokens to work.
@ -213,20 +213,20 @@ Certainly, you will need to finetune the model for the new tokens to work.
### Caveats ### Caveats
The tokenizer of Qwen operates directly on UTF-8 byte sequences, unlike others, e.g., SentencePiece that operates on UTF-8 codepoints/characters and falls back to UTF-8 byte sequences for the unknown (IIRC). The tokenizer of Qwen operates directly on UTF-8 byte sequences, unlike others, e.g., SentencePiece that operates on Unicode codepoints/characters and falls back to UTF-8 byte sequences for the unknown (IIRC).
The thing is if the frequencies are computed on limited data, the UTF-8 codepoint boundary may not be correctly recognized. The thing is if the frequencies are computed on limited data, the Unicode codepoint boundary may not be correctly recognized.
In theory, it could be a problem for fine-tuned models using the expanded vocabulary with limited data. In theory, it could be a problem for fine-tuned models using the expanded vocabulary with limited data.
For example, it could happen that `b'\x80\xe5'` might be merged first for the UTF-8 byte sequence `b'\xe4\xb8\x80\xe5\x8f\xaa'` of the string `一只`, across the UTF-8 codepoint of `一`(`b'\xe4\xb8\x80'`) and `只` (`b'\xe5\x8f\xaa'`). For example, it could happen that `b'\x80\xe5'` might be merged first for the UTF-8 byte sequence `b'\xe4\xb8\x80\xe5\x8f\xaa'` of the string `一只`, across the Unicode codepoint of `一` (`b'\xe4\xb8\x80'`) and `只` (`b'\xe5\x8f\xaa'`).
Normally, this would work just fine for known words, but for actually unknown words, unusual merges may happen, which may not be well understood for the pre-trained model. Normally, this would work just fine for known tokens, but for actually unknown words, unusual merges may happen, which may not be well understood for the pre-trained model.
Our advice is that to be safe, you should gather the UTF-8 codepoints from all the words you need to add, and also add them to the file with frequencies higher than the sum of the frequencies of the corresponding words. Our advice is that to be safe, you should gather the Unicode codepoints from all the words you need to add, and also add them to the file with frequencies higher than the sum of the frequencies of the corresponding words.
But since Qwen has most of the Chinese words, it could be okay to just add the Chinese words alone. But since Qwen has most of the Chinese words, it could be okay to just add the Chinese words alone.
For curious minds, you will also notice that in the given example, `一只` is a token and `只猫` is also learned as a new token. For curious minds, you will also notice that in the given example, `一只` is a token and `只猫` is also learned as a new token.
The reason is that `是一` is also a token in Qwen and has higher merging priority than `一只`, such that the merging path for `是|一|只|猫` is `是一|只|猫 -> 是一|只猫 -> 是一只猫` (omitting the UTF-8 byte merges). The reason is that `是一` is also a token in Qwen and has higher merging priority than `一只`, such that the merging path for `是|一|只|猫` is `是一|只|猫 -> 是一|只猫 -> 是一只猫` (omitting the UTF-8 byte merges).
This is the characteristic for plain BPE: it is based solely on distribution, meaning it does not have knowledge of which bytes can form a valid UTF-8 codepoint, character, or meaningful word. This is the characteristic for plain BPE: it is based solely on distribution, meaning it does not have knowledge of which bytes can form a valid Unicode codepoint, character, or meaningful word.
The byproduct is that text may be sub-tokenized differently in different contexts, even for words containing only ASCII characters. The byproduct is that text may be sub-tokenized differently in different contexts, even for words containing only ASCII characters.
```python ```python

@ -128,3 +128,121 @@ To disable this check for all special tokens, pass `disallowed_special=()`.
{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]} {'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
``` ```
## 词表扩展
> 特别提醒:请仔细阅读本部分的说明,理解每一步操作,并承担可能的后果。
> 由于词表扩展部分由您提供,产出方式的差异可能导致特定的不兼容情况,请审慎操作。
Qwen系列模型的tokenizer基于BPE方案提取文本中的token。
从UTF-8编码的字节开始每个字节都可以是一个token两两token合并成为新token直至不能再合并出新的token为止。
由于词表同时还记录了token的合并方式直接向词表中添加词可能对Qwen的tokenizer并不适用即通过已有的token可能合并不出来您添加词。
因而,请参照以下步骤获得合并信息:
1. 准备一个纯文本文件,例如名为`qwen_extra_vocab.txt`,每行一个待添加的词和它的频率,中间用制表符`\t`分隔。
以下是一个文件的例子:
```
我是一只猫 20
你是一只猫 10
他是一只猫 5
一只 200
一只猫 100
夸张的 比喻手法 20
```
频率是必需的,用来计算合并的优先级。
2. 准备基础的词表文件,例如`qwen.tiktoken`并确认新加入token的起始索引。
Qwen模型词表中有151,643个普通token有208个特殊token。
简单起见起始索引可以设置为151,851默认值
您可以覆写不起效的特殊token但您需要相应的修改tokenizer代码。
3. 运行以下命令:
```
python add_merges.py qwen.tiktoken qwen_extra.tiktoken qwen_extra_vocab.txt
```
`add_merges.py`代码在[GitHub存储库](examples/add_merges.py)中。
基于提供的`qwen_extra_vocab.txt`该脚本将学习新的token合并方式。
新token及其索引将存储在`qwen_extra.tiktoken`文件中。
您可以视情况修改有关路径。
由于是纯Python实现如果您添加了非常多的词预期会花费较多时间。
请注意由于预切分有些词是无法作为token加入的。
如果您添加了这些词,您会收到警告:
```
WARNING - 夸张的 比喻手法 would be pre-tokenized to ['夸张的', ' 比喻手法'], and thus cannot be added to vocabulary
WARNING - word 一只 is already a token b'\xe4\xb8\x80\xe5\x8f\xaa', skipping
INFO - number of existing merges: 151643
INFO - number of words for expanding: 4
DEBUG - (b'\xe4\xb8\x80\xe5\x8f\xaa', b'\xe7\x8c\xab') (一只猫) is selected as the next merge with freq 100
DEBUG - (b'\xe5\x8f\xaa', b'\xe7\x8c\xab') (只猫) is selected as the next merge with freq 35
DEBUG - (b'\xe6\x98\xaf\xe4\xb8\x80', b'\xe5\x8f\xaa\xe7\x8c\xab') (是一只猫) is selected as the next merge with freq 35
DEBUG - (b'\xe6\x88\x91', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (我是一只猫) is selected as the next merge with freq 20
DEBUG - (b'\xe4\xbd\xa0', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (你是一只猫) is selected as the next merge with freq 10
DEBUG - (b'\xe4\xbb\x96', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (他是一只猫) is selected as the next merge with freq 5
INFO - number of newly learned merges: 6
```
`qwen_extra.tiktoken`会包含以下内容:
```
5LiA5Y+q54yr 151851
5Y+q54yr 151852
5piv5LiA5Y+q54yr 151853
5oiR5piv5LiA5Y+q54yr 151854
5L2g5piv5LiA5Y+q54yr 151855
5LuW5piv5LiA5Y+q54yr 151856
```
您可以按如下方式使用扩展后的词表:
``` python
from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True, extra_vocab_file="qwen_extra.tiktoken")
>>> len(tokenizer)
151857
>>> tokenizer("我是一只猫")
{'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]}
```
注意您需要使用2023年10月8日后的tokenizer代码才能传递`extra_vocab_file`参数。如是其它情况,您可以将`qwen_extra.tiktoken`内容复制粘贴到`qwen.tiktoken`内容后面。
您需要微调模型才能使新的token发挥作用。
### 注意事项
Qwen的tokenizer是直接从UTF-8编码的字节序列开始处理的这与其它tokenizer比如SentencePiece是很不一样的。SentencePiece是从Unicode码位可以理解为一个字符开始处理遇到未登录的再用UTF-8编码成字节。
从字节开始的一个潜在问题是如果频率信息不够准确比如频率信息是在很少数据上统计得到的Unicode码位按UTF-8编码成字节后的边界可能会出现差错。
理论上,如果模型微调数据量不足,使用扩展后的词表也可能出现意外问题。
举个例子(非实际情况),对于`一只`的UTF-8字节序列`b'\xe4\xb8\x80\xe5\x8f\xaa'`,中间两个字节`b'\x80\xe5'`可能会先合并为一个token跨越了`一`(`b'\xe4\xb8\x80'`)和`只`(`b'\xe5\x8f\xaa'`)的码位边界。
这对于已登录token不会有什么影响最后总会合并为`一只`),但对于未登录的,可能会产生一些不同寻常的合并/token。
这些token序列可能对于预训练模型是陌生的。
我们的建议是保险起见您最好先收集待添加词中的所有Unicode码位然后单独指定它们的频率大于其所构成词的频率之和。
不过由于Qwen的tokenizer已包含了大多数中文字对于中文词的话不添加中文字的频率大部分情况下是可行的。
您可能已经发现了,在提供的例子中,`一只`已经是登录过的token了但`只猫`还是学习成为了一个新token出现了“交叉”。
原因是在Qwen中`是一`也是一个已知token且其频率/优先级比`一只`要高,因而对于`是|一|只|猫`这个片段,合并的次序是`是一|只|猫 -> 是一|只猫 -> 是一只猫`省略UTF-8字节级别的合并
这是常规BPE的特性其完全基于分布并不知道哪些字节可以构成合法的Unicode码位、合法的字符或是词。
副产物是一段文本在不同的上下文下可能会有不同的tokenize结果对于仅包含ASCII字符的文本同样如此。
```python
>>> tokenizer.tokenize("Panda")
[b'P', b'anda']
>>> tokenizer.tokenize(" Panda")
[b' Panda']
>>> tokenizer.tokenize("Pandas")
[b'P', b'andas']
>>> tokenizer.tokenize(" Pandas")
[b' Pand', b'as']
```
这仅说明在用于学习BPE的数据中这样的组合是更高频的。
如果您有海量的训练语料,这并不会是个问题。
Loading…
Cancel
Save