update tokenization_note.md

1 year ago · dcfc400881
parent 42bd2fa694
commit dcfc400881
2 changed files with 125 additions and 7 deletions
--- a/tokenization_note.md
+++ b/tokenization_note.md
@ -204,7 +204,7 @@ from transformers import AutoTokenizer
 >>> tokenizer("我是一只猫")
 {'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]}
 ```
-Note: You need the latest tokenizer code, i.e., after 2013-10-08, to use the `extra_vocab_file` argument.
+Note: You need the latest tokenizer code, i.e., after 2023-10-08, to use the `extra_vocab_file` argument.
 Otherwise, you need to manually append `qwen.tiktoken` (of which path varies with your configuration) with the content from `qwen_extra.tiktoken`.

 Certainly, you will need to finetune the model for the new tokens to work.
@ -213,20 +213,20 @@ Certainly, you will need to finetune the model for the new tokens to work.
 ### Caveats


-The tokenizer of Qwen operates directly on UTF-8 byte sequences, unlike others, e.g., SentencePiece that operates on UTF-8 codepoints/characters and falls back to UTF-8 byte sequences for the unknown (IIRC). 
-The thing is if the frequencies are computed on limited data, the UTF-8 codepoint boundary may not be correctly recognized.
+The tokenizer of Qwen operates directly on UTF-8 byte sequences, unlike others, e.g., SentencePiece that operates on Unicode codepoints/characters and falls back to UTF-8 byte sequences for the unknown (IIRC). 
+The thing is if the frequencies are computed on limited data, the Unicode codepoint boundary may not be correctly recognized.
 In theory, it could be a problem for fine-tuned models using the expanded vocabulary with limited data.

-For example, it could happen that `b'\x80\xe5'` might be merged first for the UTF-8 byte sequence `b'\xe4\xb8\x80\xe5\x8f\xaa'` of the string `一只`, across the UTF-8 codepoint of `一`(`b'\xe4\xb8\x80'`) and `只` (`b'\xe5\x8f\xaa'`).
-Normally, this would work just fine for known words, but for actually unknown words, unusual merges may happen, which may not be well understood for the pre-trained model.
+For example, it could happen that `b'\x80\xe5'` might be merged first for the UTF-8 byte sequence `b'\xe4\xb8\x80\xe5\x8f\xaa'` of the string `一只`, across the Unicode codepoint of `一` (`b'\xe4\xb8\x80'`) and `只` (`b'\xe5\x8f\xaa'`).
+Normally, this would work just fine for known tokens, but for actually unknown words, unusual merges may happen, which may not be well understood for the pre-trained model.

-Our advice is that to be safe, you should gather the UTF-8 codepoints from all the words you need to add, and also add them to the file with frequencies higher than the sum of the frequencies of the corresponding words.
+Our advice is that to be safe, you should gather the Unicode codepoints from all the words you need to add, and also add them to the file with frequencies higher than the sum of the frequencies of the corresponding words.
 But since Qwen has most of the Chinese words, it could be okay to just add the Chinese words alone.

 For curious minds, you will also notice that in the given example, `一只` is a token and `只猫` is also learned as a new token. 
 The reason is that `是一` is also a token in Qwen and has higher merging priority than `一只`, such that the merging path for `是|一|只|猫` is `是一|只|猫 -> 是一|只猫 -> 是一只猫` (omitting the UTF-8 byte merges).

-This is the characteristic for plain BPE: it is based solely on distribution, meaning it does not have knowledge of which bytes can form a valid UTF-8 codepoint, character, or meaningful word.
+This is the characteristic for plain BPE: it is based solely on distribution, meaning it does not have knowledge of which bytes can form a valid Unicode codepoint, character, or meaningful word.

 The byproduct is that text may be sub-tokenized differently in different contexts, even for words containing only ASCII characters.
 ```python
--- a/tokenization_note_zh.md
+++ b/tokenization_note_zh.md
@ -128,3 +128,121 @@ To disable this check for all special tokens, pass `disallowed_special=()`.
 {'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
 ```

+## 词表扩展
+
+> 特别提醒：请仔细阅读本部分的说明，理解每一步操作，并承担可能的后果。
+> 由于词表扩展部分由您提供，产出方式的差异可能导致特定的不兼容情况，请审慎操作。
+
+Qwen系列模型的tokenizer基于BPE方案提取文本中的token。
+从UTF-8编码的字节开始（每个字节都可以是一个token），两两token合并成为新token，直至不能再合并出新的token为止。
+由于词表同时还记录了token的合并方式，直接向词表中添加词可能对Qwen的tokenizer并不适用，即通过已有的token可能合并不出来您添加词。
+
+因而，请参照以下步骤获得合并信息：
+
+1. 准备一个纯文本文件，例如名为`qwen_extra_vocab.txt`，每行一个待添加的词和它的频率，中间用制表符`\t`分隔。
+
+   以下是一个文件的例子：
+   ```
+   我是一只猫	20
+   你是一只猫	10
+   他是一只猫	5
+   一只	200
+   一只猫	100
+   夸张的 比喻手法	20  
+   ```
+   频率是必需的，用来计算合并的优先级。
+
+2. 准备基础的词表文件，例如`qwen.tiktoken`，并确认新加入token的起始索引。
+
+   Qwen模型词表中有151,643个普通token，有208个特殊token。
+   简单起见，起始索引可以设置为151,851（默认值）。
+   您可以覆写不起效的特殊token，但您需要相应的修改tokenizer代码。
+
+3. 运行以下命令：
+   ```
+   python add_merges.py qwen.tiktoken qwen_extra.tiktoken qwen_extra_vocab.txt
+   ```
+   `add_merges.py`代码在[GitHub存储库](examples/add_merges.py)中。
+   基于提供的`qwen_extra_vocab.txt`，该脚本将学习新的token合并方式。
+   新token及其索引将存储在`qwen_extra.tiktoken`文件中。
+   您可以视情况修改有关路径。
+
+   由于是纯Python实现，如果您添加了非常多的词，预期会花费较多时间。
+
+   请注意，由于预切分，有些词是无法作为token加入的。
+   如果您添加了这些词，您会收到警告：
+   ```
+   WARNING - 夸张的 比喻手法 would be pre-tokenized to ['夸张的', ' 比喻手法'], and thus cannot be added to vocabulary
+   WARNING - word 一只 is already a token b'\xe4\xb8\x80\xe5\x8f\xaa', skipping
+   INFO - number of existing merges: 151643
+   INFO - number of words for expanding: 4
+   DEBUG - (b'\xe4\xb8\x80\xe5\x8f\xaa', b'\xe7\x8c\xab') (一只猫) is selected as the next merge with freq 100
+   DEBUG - (b'\xe5\x8f\xaa', b'\xe7\x8c\xab') (只猫) is selected as the next merge with freq 35
+   DEBUG - (b'\xe6\x98\xaf\xe4\xb8\x80', b'\xe5\x8f\xaa\xe7\x8c\xab') (是一只猫) is selected as the next merge with freq 35
+   DEBUG - (b'\xe6\x88\x91', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (我是一只猫) is selected as the next merge with freq 20
+   DEBUG - (b'\xe4\xbd\xa0', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (你是一只猫) is selected as the next merge with freq 10
+   DEBUG - (b'\xe4\xbb\x96', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (他是一只猫) is selected as the next merge with freq 5
+   INFO - number of newly learned merges: 6
+   ```
+
+`qwen_extra.tiktoken`会包含以下内容：
+```
+5LiA5Y+q54yr 151851
+5Y+q54yr 151852
+5piv5LiA5Y+q54yr 151853
+5oiR5piv5LiA5Y+q54yr 151854
+5L2g5piv5LiA5Y+q54yr 151855
+5LuW5piv5LiA5Y+q54yr 151856
+```
+
+您可以按如下方式使用扩展后的词表：
+``` python
+from transformers import AutoTokenizer
+
+>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True, extra_vocab_file="qwen_extra.tiktoken")
+
+>>> len(tokenizer)
+151857
+
+>>> tokenizer("我是一只猫")
+{'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]}
+```
+
+注意：您需要使用2023年10月8日后的tokenizer代码才能传递`extra_vocab_file`参数。如是其它情况，您可以将`qwen_extra.tiktoken`内容复制粘贴到`qwen.tiktoken`内容后面。
+
+您需要微调模型才能使新的token发挥作用。
+
+### 注意事项
+
+Qwen的tokenizer是直接从UTF-8编码的字节序列开始处理的，这与其它tokenizer比如SentencePiece是很不一样的。SentencePiece是从Unicode码位（可以理解为一个字符）开始处理，遇到未登录的再用UTF-8编码成字节。
+从字节开始的一个潜在问题是如果频率信息不够准确，比如频率信息是在很少数据上统计得到的，Unicode码位按UTF-8编码成字节后的边界可能会出现差错。
+理论上，如果模型微调数据量不足，使用扩展后的词表也可能出现意外问题。
+
+举个例子（非实际情况），对于`一只`的UTF-8字节序列`b'\xe4\xb8\x80\xe5\x8f\xaa'`，中间两个字节`b'\x80\xe5'`可能会先合并为一个token，跨越了`一`(`b'\xe4\xb8\x80'`)和`只`(`b'\xe5\x8f\xaa'`)的码位边界。
+这对于已登录token不会有什么影响（最后总会合并为`一只`），但对于未登录的，可能会产生一些不同寻常的合并/token。
+这些token序列可能对于预训练模型是陌生的。
+
+我们的建议是保险起见，您最好先收集待添加词中的所有Unicode码位，然后单独指定它们的频率大于其所构成词的频率之和。
+不过由于Qwen的tokenizer已包含了大多数中文字，对于中文词的话，不添加中文字的频率，大部分情况下是可行的。
+
+您可能已经发现了，在提供的例子中，`一只`已经是登录过的token了，但`只猫`还是学习成为了一个新token，出现了“交叉”。
+原因是在Qwen中`是一`也是一个已知token，且其频率/优先级比`一只`要高，因而对于`是|一|只|猫`这个片段，合并的次序是`是一|只|猫 -> 是一|只猫 -> 是一只猫`（省略UTF-8字节级别的合并）。
+
+这是常规BPE的特性，其完全基于分布，并不知道哪些字节可以构成合法的Unicode码位、合法的字符或是词。
+
+副产物是一段文本在不同的上下文下可能会有不同的tokenize结果，对于仅包含ASCII字符的文本同样如此。
+```python
+>>> tokenizer.tokenize("Panda")
+[b'P', b'anda']
+
+>>> tokenizer.tokenize(" Panda")
+[b' Panda']
+
+>>> tokenizer.tokenize("Pandas")
+[b'P', b'andas']
+
+>>> tokenizer.tokenize(" Pandas")
+[b' Pand', b'as']
+```
+这仅说明在用于学习BPE的数据中，这样的组合是更高频的。
+如果您有海量的训练语料，这并不会是个问题。