Merge branch 'QwenLM:main' into add_ja-readme

2 years ago · 81185f0b3f
parent 2b0edeadd4 84cbc73b2b
commit 81185f0b3f
15 changed files with 677 additions and 138 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.yaml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
@ -1,41 +1,49 @@
 name: 🐞 Bug
-description: File a bug/issue
+description: 提交错误报告 | File a bug/issue
 title: "[BUG] <title>"
 labels: ["Bug"]
 body:
  - type: checkboxes
    attributes:
-      label: Is there an existing issue for this?
-      description: Please search to see if an issue already exists for the bug you encountered.
+      label: 是否已有关于该错误的issue？ | Is there an existing issue for this?
+      description: |
+        请先搜索您遇到的错误是否在已有的issues中提到过。
+        Please search to see if an issue already exists for the bug you encountered.
      options:
-        - label: I have searched the existing issues
+        - label: 我已经搜索过已有的issues | I have searched the existing issues
          required: true
  - type: textarea
    attributes:
-      label: Current Behavior
-      description: A concise description of what you're experiencing.
+      label: 当前行为 | Current Behavior
+      description: |
+        准确描述遇到的行为。
+        A concise description of what you're experiencing.
    validations:
      required: false
  - type: textarea
    attributes:
-      label: Expected Behavior
-      description: A concise description of what you expected to happen.
+      label: 期望行为 | Expected Behavior
+      description: |
+        准确描述预期的行为。
+        A concise description of what you expected to happen.
    validations:
      required: false
  - type: textarea
    attributes:
-      label: Steps To Reproduce
-      description: Steps to reproduce the behavior.
+      label: 复现方法 | Steps To Reproduce
+      description: |
+        复现当前行为的详细步骤。
+        Steps to reproduce the behavior.
      placeholder: |
        1. In this environment...
-        1. With this config...
-        1. Run '...'
-        1. See error...
+        2. With this config...
+        3. Run '...'
+        4. See error...
    validations:
      required: false
  - type: textarea
    attributes:
-      label: Environment
+      label: 运行环境 | Environment
      description: |
        examples:
          - **OS**: Ubuntu 20.04
@ -54,8 +62,12 @@ body:
      required: false
  - type: textarea
    attributes:
-      label: Anything else?
+      label: 备注 | Anything else?
      description: |
+        您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
+        
+        您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
+        
        Links? References? Anything that will give us more context about the issue you are encountering!
        
        Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
--- a/.github/ISSUE_TEMPLATE/feature_request.yaml
+++ b/.github/ISSUE_TEMPLATE/feature_request.yaml
@ -1,5 +1,5 @@
 name: "💡 Feature Request"
-description: Create a new ticket for a new feature request
+description: 创建新功能请求 | Create a new ticket for a new feature request
 title: "💡 [REQUEST] - <title>"
 labels: [
  "question"
@ -8,39 +8,48 @@ body:
  - type: input
    id: start_date
    attributes:
-      label: "Start Date"
-      description: Start of development
+      label: "起始日期 | Start Date"
+      description: |
+        起始开发日期
+        Start of development
      placeholder: "month/day/year"
    validations:
      required: false
  - type: textarea
    id: implementation_pr
    attributes:
-      label: "Implementation PR"
-      description: Pull request used
+      label: "实现PR | Implementation PR"
+      description: |
+        实现该功能的Pull request
+        Pull request used
      placeholder: "#Pull Request ID"
    validations:
      required: false
  - type: textarea
    id: reference_issues
    attributes:
-      label: "Reference Issues"
-      description: Common issues
+      label: "相关Issues | Reference Issues"
+      description: |
+        与该功能相关的issues
+        Common issues
      placeholder: "#Issues IDs"
    validations:
      required: false
  - type: textarea
    id: summary
    attributes:
-      label: "Summary"
-      description: Provide a brief explanation of the feature
-      placeholder: Describe in a few lines your feature request
+      label: "摘要 | Summary"
+      description: |
+        简要描述新功能的特点
+        Provide a brief explanation of the feature
+      placeholder: |
+        Describe in a few lines your feature request
    validations:
      required: true
  - type: textarea
    id: basic_example
    attributes:
-      label: "Basic Example"
+      label: "基本示例 | Basic Example"
      description: Indicate here some basic examples of your feature.
      placeholder: A few specific words about your feature request.
    validations:
@ -48,16 +57,22 @@ body:
  - type: textarea
    id: drawbacks
    attributes:
-      label: "Drawbacks"
-      description: What are the drawbacks/impacts of your feature request ?
-      placeholder: Identify the drawbacks and impacts while being neutral on your feature request
+      label: "缺陷 | Drawbacks"
+      description: |
+        该新功能有哪些缺陷/可能造成哪些影响？
+        What are the drawbacks/impacts of your feature request ?
+      placeholder: |
+        Identify the drawbacks and impacts while being neutral on your feature request
    validations:
      required: true
  - type: textarea
    id: unresolved_question
    attributes:
-      label: "Unresolved questions"
-      description: What questions still remain unresolved ?
-      placeholder: Identify any unresolved issues.
+      label: "未解决问题 | Unresolved questions"
+      description: |
+        有哪些尚未解决的问题？
+        What questions still remain unresolved ?
+      placeholder: |
+        Identify any unresolved issues.
    validations:
      required: false
--- a/2
+++ b/2
@ -50,4 +50,4 @@ If you are commercially using the Materials, and your product or service has mor

 9. Governing Law and Jurisdiction.
    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
-    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
+    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
--- a/2
+++ b/2
@ -49,4 +49,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -1,4 +1,5 @@
 <br>
+
 <p align="center">
    <img src="assets/logo.jpg" width="400"/>
 <p>
@ -50,7 +51,7 @@ In general, Qwen-7B outperforms the baseline models of a similar model size, and
 <p>
 <br>

-For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](techmemo-draft.md).
+For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md).

 ## Requirements

@ -73,6 +74,7 @@ If your device supports fp16 or bf16, we recommend installing [flash-attention](
 ```bash
 git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
 cd flash-attention && pip install .
+# Below are optional. Installing them might be slow.
 pip install csrc/layer_norm
 pip install csrc/rotary
 ```
@ -87,8 +89,7 @@ To use Qwen-7B-Chat for the inference, all you need to do is to input a few line
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.generation import GenerationConfig

-# Note: For tokenizer usage, please refer to examples/tokenizer_showcase.ipynb. 
-# The default behavior now has injection attack prevention off.
+# Note: The default behavior now has injection attack prevention off.
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

 # use bf16
@ -109,7 +110,7 @@ print(response)
 # 你好！很高兴为你提供帮助。

 # 第二轮对话 2nd dialogue turn
-response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) 
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
 print(response)
 # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
 # 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
@ -147,7 +148,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto",
 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

 inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
-inputs = inputs.to('cuda:0')
+inputs = inputs.to(model.device)
 pred = model.generate(**inputs)
 print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 # 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
@ -184,6 +185,10 @@ response, history = results['response'], results['history']
 print(f'Response: {response}')
 ```

+## Tokenizer
+
+Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
+
 ## Quantization

 We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
@ -232,14 +237,14 @@ We provide a CLI demo example in `cli_demo.py`, which supports streaming output

 ## Tool Usage

-Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In the soon-to-be-released internal evaluation benchmark for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
+Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
 [](https://)

 | Model       | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
 |-------------|------------------------|-----------------------|-----------------------|
 | GPT-4       | 95%                    | **0.90**              | 15%                   |
 | GPT-3.5     | 85%                    | 0.88                  | 75%                   |
-| **Qwen-7B** | **99%**                | 0.89                  | **8.5%**              |
+| **Qwen-7B** | **99%**                | 0.89                  | **9.7%**              |

 For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.

@ -288,4 +293,3 @@ Researchers and developers are free to use the codes and model weights of both Q
 ## Contact Us

 If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.
-
--- a/README_CN.md
+++ b/README_CN.md
@ -1,4 +1,5 @@
 <br>
+
 <p align="center">
    <img src="assets/logo.jpg" width="400"/>
 <p>
@ -50,7 +51,7 @@ Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、
 <p>
 <br>

-更多的实验结果和细节请查看我们的技术备忘录。点击[这里](techmemo-draft.md)。
+更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。

 ## 要求

@ -73,6 +74,7 @@ pip install -r requirements.txt
 ```bash
 git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
 cd flash-attention && pip install .
+# 下方安装可选，安装可能比较缓慢。
 pip install csrc/layer_norm
 pip install csrc/rotary
 ```
@ -87,7 +89,7 @@ pip install csrc/rotary
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.generation import GenerationConfig

-# 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。相关使用指引，请见examples/tokenizer_showcase.ipynb
+# 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

 # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
@ -108,7 +110,7 @@ print(response)
 # 你好！很高兴为你提供帮助。

 # 第二轮对话 2nd dialogue turn
-response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) 
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
 print(response)
 # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
 # 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
@ -147,7 +149,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto",
 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

 inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
-inputs = inputs.to('cuda:0')
+inputs = inputs.to(model.device)
 pred = model.generate(**inputs)
 print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 # 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
@ -184,6 +186,13 @@ response, history = results['response'], results['history']
 print(f'Response: {response}')
 ```

+## Tokenization
+
+> 注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。
+
+基于tiktoken的tokenizer有别于其他分词器，比如sentencepiece tokenizer。尤其在微调阶段，需要特别注意特殊token的使用。关于tokenizer的更多信息，以及微调时涉及的相关使用，请参阅[文档](tokenization_note_zh.md)。
+
+
 ## 量化

 如希望使用更低精度的量化模型，如4比特和8比特的模型，我们提供了简单的示例来说明如何快速使用量化模型。在开始前，确保你已经安装了`bitsandbytes`。请注意，`bitsandbytes`的安装要求是：
@ -232,13 +241,13 @@ model = AutoModelForCausalLM.from_pretrained(

 ## 工具调用

-Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。我们在内部的即将开源的评测数据集上测试模型的工具调用能力，并发现Qwen-7B-Chat能够取得稳定的表现。
+Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。在我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力，并发现Qwen-7B-Chat能够取得稳定的表现。

 | Model       | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
 | ------------- | ------------------------- | ------------------------ | ------------------------ |
 | GPT-4       | 95%                     | **0.90**               | 15%                    |
 | GPT-3.5     | 85%                     | 0.88                   | 75%                    |
-| **Qwen-7B** | **99%**                 | 0.89                   | **8.5%**               |
+| **Qwen-7B** | **99%**                 | 0.89                   | **9.7%**               |

 我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。

@ -289,4 +298,3 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
 ## 联系我们

 如果你想给我们的研发团队和产品团队留言，请通过邮件（qianwen_opensource@alibabacloud.com）联系我们。
-
--- a/cli_demo.py
+++ b/cli_demo.py
@ -177,7 +177,7 @@ def main():
        # Run chat.
        set_seed(seed)
        try:
-            for response in model.chat(tokenizer, query, history=history, stream=True):
+            for response in model.chat_stream(tokenizer, query, history=history):
                _clear_screen()
                print(f"\nUser: {query}")
                print(f"\nQwen-7B: {response}")
--- a/demo.py
+++ b/demo.py
@ -1,83 +0,0 @@
-# Copyright (c) Alibaba Cloud.
-#
-# This source code is licensed under the license found in the
-# LICENSE file in the root directory of this source tree.
-
-import torch
-import argparse
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers.trainer_utils import set_seed
-
-
-def _load_model_tokenizer(args):
-    tokenizer = AutoTokenizer.from_pretrained(
-        args.checkpoint_path, trust_remote_code=True,
-    )
-    print("load tokenizer")
-
-    if args.cpu_only:
-        device_map = "cpu"
-        max_memory = None
-    else:
-        device_map = "auto"
-        max_memory_str = f"{int(torch.cuda.mem_get_info()[0] / 1024 ** 3) - 2}GB"
-        n_gpus = torch.cuda.device_count()
-        max_memory = {i: max_memory_str for i in range(n_gpus)}
-
-    model = AutoModelForCausalLM.from_pretrained(
-        args.checkpoint_path,
-        device_map=device_map,
-        max_memory=max_memory,
-        trust_remote_code=True,
-    ).eval()
-
-    return model, tokenizer
-
-
-def demo_qwen_pretrain(args):
-    model, tokenizer = _load_model_tokenizer(args)
-    inputs = tokenizer(
-        "蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是",
-        return_tensors="pt",
-    )
-    inputs = inputs.to(model.device)
-    pred = model.generate(inputs=inputs["input_ids"])
-    print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
-
-
-def demo_qwen_chat(args):
-    model, tokenizer = _load_model_tokenizer(args)
-    queries = [
-        "请问把大象关冰箱总共要几步？",
-        "1+3=?",
-        "请将下面这句话翻译为英文：在哪里跌倒就在哪里趴着",
-    ]
-    history = None
-    for turn_idx, query in enumerate(queries, start=1):
-        response, history = model.chat(
-            tokenizer,
-            query,
-            history=history,
-        )
-        print(f"===== Turn {turn_idx} ====")
-        print("Query:", query, end="\n")
-        print("Response:", response, end="\n")
-
-
-def main():
-    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
-    parser.add_argument("-c", "--checkpoint-path", type=str, help="Checkpoint path")
-    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
-    parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only")
-
-    args = parser.parse_args()
-    set_seed(args.seed)
-
-    if "chat" in args.checkpoint_path.lower():
-        demo_qwen_chat(args)
-    else:
-        demo_qwen_pretrain(args)
-
-
-if __name__ == "__main__":
-    main()
--- a/eval/EVALUATION.md
+++ b/eval/EVALUATION.md
@ -49,9 +49,9 @@ evaluate_functional_correctness HumanEval_res.jsonl
 python evaluate_chat_mmlu.py -f HumanEval.jsonl -o HumanEval_res_chat.jsonl
 evaluate_functional_correctness HumanEval_res_chat.jsonl
 ```
-                                         
+
 When installing package human-eval, please note its following disclaimer:
-                                         
+
 This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

 - GSM8K
@ -64,3 +64,20 @@ python evaluate_gsm8k.py
 python evaluate_chat_gsm8k.py # zeroshot
 python evaluate_chat_gsm8k.py --use-fewshot # fewshot
 ```
+
+- PLUGIN
+
+This script is used to reproduce the results of the ReAct and Hugging Face Agent in the Tool Usage section of the README document.
+
+```Shell
+# Qwen-7B-Chat
+mkdir data;
+cd data;
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
+cd ..;
+pip install json5;
+pip install jsonlines;
+pip install rouge_score;
+python evaluate_plugin.py --eval-react-positive --eval-react-negative --eval-hfagent
+```
--- a/eval/evaluate_plugin.py
+++ b/eval/evaluate_plugin.py
@ -0,0 +1,308 @@
+import argparse
+import json
+import os
+import pprint
+
+import json5
+import jsonlines
+from rouge_score import rouge_scorer
+from tqdm import tqdm
+from transformers import Agent, AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+from transformers.tools.evaluate_agent import evaluate_agent
+from transformers.trainer_utils import set_seed
+
+data_root_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                              'data')
+
+
+def is_callable(response, golden):
+    return response['action'].strip().lower() == golden['action'].strip(
+    ).lower()
+
+
+def process_res(response):
+    # parse response
+    response += '\n'  # fix not-find bug
+    thought = response[:response.find('Action:')].strip()
+    action = response[response.find('Action:') +
+                      len('Action:'):response.find('Action Input:')].strip()
+    action_input = response[response.find('Action Input:') +
+                            len('Action Input:'):response.find('Observation:'
+                                                               )].strip()
+    #TODO: This parsing result is incorrect if the response contains multiple Actions. To be fixed in the future.
+    observation = response[response.find('Observation:') +
+                           len('Observation:'):response.rfind('Thought:'
+                                                              )].strip()
+    thought_last = response[response.rfind('Thought:') +
+                            len('Thought:'):response.find('Final Answer:'
+                                                          )].strip()
+    final_answer = response[response.find('Final Answer:') +
+                            len('Final Answer:'):].strip()
+    try:
+        action_input = json.dumps(json5.loads(action_input),
+                                  ensure_ascii=False,
+                                  sort_keys=True)
+    except:
+        # print("JSON Load Error:", action_input)
+        pass
+    res_dict = {
+        'thought': thought,
+        'action': action,
+        'action_input': action_input,
+        'observation': observation,
+        'thought_last': thought_last,
+        'final_answer': final_answer
+    }
+    return res_dict
+
+
+class _DummyTokenizer:
+    def tokenize(self, text: str):
+        return text.split()
+
+
+def _get_tokenized_string(tokenizer, text_list):
+    token_ids_list, tokenized_string_list = [], []
+    for text in text_list:
+        assert tokenizer is not None
+        token_ids = tokenizer.encode(text)
+        tokens_bytes = tokenizer.convert_ids_to_tokens(token_ids)
+        tokens = [
+            token.decode('utf-8', errors='replace') for token in tokens_bytes
+        ]
+        tokenized_string = ' '.join(tokens)
+        token_ids_list.append(token_ids)
+        tokenized_string_list.append(tokenized_string)
+    return token_ids_list, tokenized_string_list
+
+
+def eval_action(job):
+    response = job['gen'][0]
+    golden = job['response']
+
+    if 'Action:' in response:
+        response, golden = process_res(response), process_res(golden)
+        if is_callable(response, golden):
+            return True
+    return False
+
+
+def eval_action_input(job, tokenizer):
+    response = job['gen'][0]
+    golden = job['response']
+    response, golden = process_res(response), process_res(golden)
+    query = job['prompt']
+
+    job = {}
+    job['prompt'] = query
+    job['gen'] = response['action_input']
+    job['response'] = golden['action_input']
+
+    job['_gen_tok'], job['_gen_tok_str'] = _get_tokenized_string(
+        tokenizer, [response['action_input']])
+    job['_reference_tok'], job['_reference_tok_str'] = _get_tokenized_string(
+        tokenizer, [golden['action_input']])
+
+    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'],
+                                      tokenizer=_DummyTokenizer())
+    score = scorer.score(job['_reference_tok_str'][0], job['_gen_tok_str'][0])
+
+    rouge = score['rougeL'].fmeasure
+
+    return rouge
+
+
+class QWenAgent(Agent):
+    """
+    Agent that uses QWen model and tokenizer to generate code.
+
+    Example:
+
+    ```py
+    agent = QWenAgent()
+    agent.run("Draw me a picture of rivers and lakes.")
+    ```
+    """
+    def __init__(self,
+                 chat_prompt_template=None,
+                 run_prompt_template=None,
+                 additional_tools=None,
+                 tokenizer=None,
+                 model=None):
+        if tokenizer and model:
+            self.tokenizer = tokenizer
+            self.model = model
+        else:
+            checkpoint = 'Qwen/Qwen-7B-Chat'
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                checkpoint, trust_remote_code=True)
+            self.model = AutoModelForCausalLM.from_pretrained(
+                checkpoint, device_map='auto',
+                trust_remote_code=True).cuda().eval()
+            self.model.generation_config = GenerationConfig.from_pretrained(
+                checkpoint, trust_remote_code=True)  # 可指定不同的生成长度、top_p等相关超参
+            self.model.generation_config.do_sample = False  # greedy
+
+        super().__init__(
+            chat_prompt_template=chat_prompt_template,
+            run_prompt_template=run_prompt_template,
+            additional_tools=additional_tools,
+        )
+
+    def generate_one(self, prompt, stop):
+        # "Human:" 和 "Assistant:" 曾为通义千问的特殊保留字，需要替换为 "_HUMAN_:" 和 "_ASSISTANT_:"。这一问题将在未来版本修复。
+        prompt = prompt.replace('Human:',
+                                '_HUMAN_:').replace('Assistant:',
+                                                    '_ASSISTANT_:')
+        stop = [
+            item.replace('Human:', '_HUMAN_:').replace('Assistant:',
+                                                       '_ASSISTANT_:')
+            for item in stop
+        ]
+
+        result, _ = self.model.chat(self.tokenizer, prompt, history=None)
+        for stop_seq in stop:
+            if result.endswith(stop_seq):
+                result = result[:-len(stop_seq)]
+
+        result = result.replace('_HUMAN_:',
+                                'Human:').replace('_ASSISTANT_:', 'Assistant:')
+        return result
+
+
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path,
+                                              trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path,
+                                                 device_map='auto',
+                                                 trust_remote_code=True,
+                                                 bf16=True,
+                                                 use_flash_attn=True).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True)
+    model.generation_config.do_sample = False  # use greedy decoding
+    return model, tokenizer
+
+
+def load_jobs(filename):
+    jobs = []
+    with jsonlines.open(os.path.join(data_root_path, filename),
+                        mode='r') as reader:
+        for job in reader:
+            jobs.append(job)
+    return jobs
+
+
+def react_inference(filename, model, tokenizer):
+    filename_cache = filename + '.cache'
+    if os.path.exists(os.path.join(data_root_path, filename_cache)):
+        jobs = load_jobs(filename=filename_cache)
+        print('Loaded from', filename_cache)
+    else:
+        with open(os.path.join(data_root_path, filename_cache), 'w') as f:
+            jobs = load_jobs(filename=filename)
+            print('Inference:', filename)
+            for job in tqdm(jobs):
+                response, history = model.chat(tokenizer,
+                                               job['prompt'],
+                                               history=None)
+                job['gen'] = [response]
+                f.writelines(json.dumps(job, ensure_ascii=False) + '\n')
+        print(filename_cache, 'is saved.')
+    return jobs
+
+
+def main(args):
+    print('loading model weights')
+    if args.checkpoint_path is not None:
+        model, tokenizer = load_models_tokenizer(args)
+    else:
+        model, tokenizer = None, None
+    print('model loaded')
+
+    result = {}
+    # eval react positive
+    if args.eval_react_positive:
+        print('eval react positive ...')
+        acc_count = 0
+        rouge_mean = 0
+        jobs = react_inference(filename=args.eval_react_positive_filename,
+                               model=model,
+                               tokenizer=tokenizer)
+        for job in jobs:
+            if eval_action(job):
+                acc_count += 1
+            rouge = eval_action_input(job, tokenizer)
+            rouge_mean += (rouge / len(jobs))
+
+        scores = {
+            'action_right_rate': acc_count / len(jobs),
+            'action_input_rouge': rouge_mean,
+        }
+
+        result.update({'react_positive': scores})
+
+    # eval react negative
+    if args.eval_react_negative:
+        print('eval react negative ...')
+        bad_count = 0
+        jobs = react_inference(filename=args.eval_react_negative_filename,
+                               model=model,
+                               tokenizer=tokenizer)
+        for job in jobs:
+            if '\nAction:' in job['gen'][0]:
+                bad_count += 1
+        scores = {'bad_rate': bad_count / len(jobs)}
+        result.update({'react_negative': scores})
+
+    # eval hfagent
+    if args.eval_hfagent:
+        print('eval hfagent ...')
+        agent = QWenAgent(model=model, tokenizer=tokenizer)
+        scores = evaluate_agent(agent, verbose=False, return_errors=False)
+        result.update({'hfagent': scores})
+
+    pp = pprint.PrettyPrinter(indent=4)
+    pp.pprint(result)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Test HF checkpoint.')
+    parser.add_argument('-c',
+                        '--checkpoint-path',
+                        type=str,
+                        help='Checkpoint path',
+                        default='Qwen/Qwen-7B-Chat')
+    parser.add_argument('-s',
+                        '--seed',
+                        type=int,
+                        default=1234,
+                        help='Random seed')
+    """Provide extra arguments required for tasks."""
+    group = parser.add_argument_group(title='Evaluation options')
+    group.add_argument('--eval-react-positive',
+                       action='store_true',
+                       default=False,
+                       help='Eval react positive.')
+    group.add_argument('--eval-react-positive-filename',
+                       type=str,
+                       default='exam_plugin_v1_react_positive.jsonl',
+                       help='Eval react positive filename.')
+    group.add_argument('--eval-react-negative',
+                       action='store_true',
+                       default=False,
+                       help='Eval react negative.')
+    group.add_argument('--eval-react-negative-filename',
+                       type=str,
+                       default='exam_plugin_v1_react_negative.jsonl',
+                       help='Eval react negative filename.')
+    group.add_argument('--eval-hfagent',
+                       action='store_true',
+                       default=False,
+                       help='Eval hfagent.')
+
+    args = parser.parse_args()
+    set_seed(args.seed)
+
+    main(args)
--- a/examples/react_prompt.md
+++ b/examples/react_prompt.md
@ -242,4 +242,4 @@ def parse_latest_plugin_call(text: str) -> Tuple[str, str]:
    return '', ''
 ```

-此外，如果输出的 Action Input 内容是一段表示 JSON 对象的文本，我们建议使用 `json5` 包的 `json5.loads(...)` 方法加载。
+此外，如果输出的 Action Input 内容是一段表示 JSON 对象的文本，我们建议使用 `json5` 包的 `json5.loads(...)` 方法加载。
--- a/requirements.txt
+++ b/requirements.txt
@ -4,3 +4,4 @@ tiktoken
 einops
 transformers_stream_generator==0.0.4
 bitsandbytes
+scipy
--- a/tech_memo.md
+++ b/tech_memo.md
@ -311,13 +311,13 @@ LLMs have shown capability in coordinating multiple external systems to achieve
 Qwen supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629).
 ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework.
 For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md).
-In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen's performance is as follows:
+In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, Qwen's performance is as follows:

 | Model       | Tool Selection (Acc.↑)      | Tool Input (Rouge-L↑)      | False Positive Error↓      |
 | :---------- | --------------------------: | -------------------------: | -------------------------: |
 | GPT-4       |                         95% |                   **0.90** |                      15.0% |
 | GPT-3.5     |                         85% |                       0.88 |                      75.0% |
-| **Qwen-7B** |                     **99%** |                       0.89 |                   **8.5%** |
+| **Qwen-7B** |                     **99%** |                       0.89 |                   **9.7%** |

 > The plugins that appear in the evaluation set do not appear in the training set of Qwen.
 > This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate.
--- a/tokenization_note.md
+++ b/tokenization_note.md
@ -0,0 +1,127 @@
+# Tokenization
+
+Qwen-7B uses BPE tokenization on UTF-8 bytes using the `tiktoken` package.
+There are two types of tokens in Qwen-7B, i.e., the regular tokens (of type `bytes`) in BPE and the special/control tokens (of type `str`).
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
+```
+
+## Regular tokens
+
+The regular tokens are BPE tokens learned from byte sequences of texts encoded using the UTF-8 encoding.
+While this allows tokenization of all texts and no unknown token exists, it may fall back to using single bytes when tokenizing uncommon texts.
+You may encounter UTF-8 decoding errors and as the errors are default to `replace`, thus the replacement character (<28>) in incomplete generation.
+You can change this behavior by passing `errors="ignore"` to the `decode` function for once or to the `from_pretrained` function forever.
+For more options of `errors`, please refer to [the Python documentation](https://docs.python.org/3/library/stdtypes.html#bytes.decode).
+
+```python
+>>> tokenizer.decode([51461])
+' <20>'
+
+>>> tokenizer.convert_ids_to_tokens([51461])
+[b' \xe6\xa0']
+
+>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
+' <20>'
+
+>>> tokenizer.decode([51461, 117])
+' 根'
+
+>>> tokenizer.convert_ids_to_tokens([51461, 117])
+[b' \xe6\xa0', b'\xb9']
+
+>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
+' 根'
+```
+
+The mapping from regular tokens (in `bytes`) to its ID can be retrieved from `tokenizer.get_vocab()`.
+We do not support or recommended adding regular tokens to the vocabulary.
+
+## Special tokens
+
+The special tokens signify special functions to the model, e.g., reaching the end of a document.
+In theory, they do not exist in the input texts and only appear after the input texts are processed.
+Their surface forms, e.g., `<|endoftext|>` for the end of a document, are only meant for ease of reference.
+Currently, used special tokens are `<|endoftext|>` in Qwen-7B, and `<|endoftext|>`, `<|im_start|>`, and `<|im_end|>` in Qwen-7B-Chat, which means they have determined meanings to the corresponding model, and should not be used otherwise.
+For other purposes, we keep extra special tokens from `<|extra_0|>` to `<|extra_204|>`, and you can use them as you wish.
+The mapping from surface forms of the special tokens (in `str`) to its ID can be retrieved from `tokenizer.special_tokens`.
+
+The concepts of `bos`, `eos`, `unk`, `pad`, `mask`, `sep` and such are not appliable to our pretrained models (Qwen-7B and Qwen-7B-Chat).
+The `pad` token, however, is a different story, as in theory, the model never sees or computes this token, so you may use any known token.
+But to be safe, we limit the value of special tokens specified in the initialization of the tokenizer to the known special tokens.
+You may specify special tokens in fine-tuning or in any other frameworks that necessitate them like this
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, pad_token='<|endoftext|>')
+```
+
+> WARNING: For our pretrained models, setting `bos`, `eos`, `unk`, and such makes no sense.
+> Unknown behavior may be introduced if you set them without fine-tuning that designates their meanings to the model.
+> Especially, you should not use `<|endoftext|>` as `eos`, unless you are sure that the end of a sentence and the end of a document, which may contain many sentences, are the same in your scenario.
+
+## Injection attack prevention
+
+As special tokens are different from regular tokens, what will happen if the surface forms of a control token appear in the input texts?
+For example, note that a piece of text like this
+
+```
+print("<|endoftext|>")
+```
+
+should be tokenized as
+
+```
+ids:[1350, 9639, 91, 8691, 723, 427, 91, 82598]
+tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']
+```
+
+not
+
+```
+ids: [1350, 445, 151643, 899]
+tokens: [b'print', b'("', '<|endoftext|>', b'")']
+```
+
+Our default used to be the correct one, that is, treating the surface forms of special tokens just like regular texts, and special tokens should be taken cared of by developers after tokenization of the texts.
+However, this conflicts with (albeit unsafe) practice in the community, and adds another step for developers to reuse their wheels.
+
+The default behavior has been changed to parse the surface forms of all the known special tokens as special tokens.
+To enable injection prevention, pass `allowed_special=set()` to the calls of the tokenizer:
+
+```python
+>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
+{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+You can control the behavior in a fine-grained manner by passing a set of `str` as `allowed_special`
+
+```python
+>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'})
+{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+You can also make the tokenizer raise errors if the surface forms of certain special tokens are encountered in the input texts by passing a collection of `str` as `disallowed_special`
+
+```python
+>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'}, disallowed_special=('<|extra_0|>', ))
+...
+ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.
+If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|extra_0|>', ...}`.
+If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|extra_0|>'})`.
+To disable this check for all special tokens, pass `disallowed_special=()`.
+```
+
+For more information on `allowed_special` and `disallowed_special`, please refer to [the `tiktoken` documentation](https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/tiktoken/core.py#L75).
+
+The new default is the same as
+
+```python
+>>> tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())
+{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
+```
+
--- a/tokenization_note_zh.md
+++ b/tokenization_note_zh.md
@ -0,0 +1,130 @@
+# Tokenization
+
+> 注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。
+
+Qwen-7B采用UTF-8字节级别的BPE tokenization方式，并依赖`tiktoken`这一高效的软件包执行分词。
+Qwen-7B中有两类token，即源于BPE、`bytes`类型的普通token和特殊指定、`str`类型的特殊token。
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
+```
+
+## 普通token
+
+普通token源于BPE，是在UTF-8编码的文本字节序列上学习得到的。
+尽管基于字节序列的方式保证了所有文本均可被tokenize且没有未登录token问题，但处理罕见文本时有可能回退到字节级别的编码。
+由于从字节序列解码为文本时，`errors`参数设为`replace`，处理不完整的token序列可能会遇到UTF-8解码错误，表象是生成中包含“替换字符”(<28>)。
+这一行为可以通过将`errors`参数设为`ignore`来规避。
+一次性修改可以传入tokenizer的`decode`函数，持久性修改可以传入tokenizer的初始化函数，请注意`decode`的配置优先级更高。
+`errors`的可选值，请参阅[Python文档](https://docs.python.org/3/library/stdtypes.html#bytes.decode).
+
+```python
+>>> tokenizer.decode([51461])
+' <20>'
+
+>>> tokenizer.convert_ids_to_tokens([51461])
+[b' \xe6\xa0']
+
+>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
+' <20>'
+
+>>> tokenizer.decode([51461, 117])
+' 根'
+
+>>> tokenizer.convert_ids_to_tokens([51461, 117])
+[b' \xe6\xa0', b'\xb9']
+
+>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
+' 根'
+```
+
+`bytes`类型的普通token到id的映射可以通过`tokenizer.get_vocab()`获取。
+尚不支持也不推荐向tokenizer增加普通token。
+
+## 特殊token
+
+特殊token用以给模型传递特殊信号，如到达文本末尾。
+理论上，输入文本中不包含特殊token，它们仅在tokenization后由开发者手动加入。
+特殊token的字面表达，如表示文本结束的`<|endoftext|>`，仅便于指代特殊token，不意味着它们在输入文本空间中。
+目前，训练中使用的、已经有固定含义的、不应做它用的特殊token，Qwen-7B中有`<|endoftext|>`，Qwen-7B-Chat中有`<|endoftext|>`、`<|im_start|>`以及`<|im_end|>`。
+但词表中也留有供扩展的特殊token位，可用`<|extra_0|>`到`<|extra_204|>`来指代。
+`str`类型的特殊token字面表达到id的映射，可以通过`tokenizer.special_tokens`获取。
+
+对于提供的模型参数(Qwen-7B和Qwen-7B-Chat)而言，诸如`bos`、`eos`、`unk`、`pad`、`mask`、`sep`等的特殊token的概念并不适用。
+特例是`pad`，由于这个token理论上并不参与模型计算，所以可以使用任意token表达这一概念。
+但保险起见，目前可在tokenizer初始化时设定的特殊token，仅可使用已知的特殊token字面表达，即`<|endoftext|>`、`<|im_start|>`、`<|im_end|>`和`<|extra_0|>`到`<|extra_204|>`。
+对于微调或者其它需要这些token才能运行的框架，可以如下配置
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, pad_token='<|endoftext|>')
+```
+
+> 注意: 对于提供的训练好的模型，设置诸如`bos`、`eos`、`unk`之类的没有意义，即模型不需要这些概念。
+> 如果设置了这些token，但没有相应的微调这些token以让模型理解其含义，未知行为可能被触发。
+> 特别时，不应混淆`<|endoftext|>`和`eos`的概念，除非应用场景中它们的实际含义是一致的，即句子末尾等价于文本末尾。
+
+**注入攻击防御**
+
+由于特殊token和普通token概念上的差异，如果输入文本中含有特殊token的字面表达该如何处理？
+以下面文本为例
+
+```
+print("<|endoftext|>")
+```
+
+其正确的tokenization为
+
+```
+ids:[1350, 9639, 91, 8691, 723, 427, 91, 82598]
+tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']
+```
+
+不是
+
+```
+ids: [1350, 445, 151643, 899]
+tokens: [b'print', b'("', '<|endoftext|>', b'")']
+```
+
+默认行为曾是正确的，即输入文本中任何字符一律按普通token处理，特殊token应由开发者在tokenization人工处理。
+然后，这与社区中的实践似有差异，为开发者复用代码增加了额外适配步骤。
+
+默认行为已被调整为从输入文本中解析特殊token的字面表达。
+如需启用注入攻击防御，请传入参数`allowed_special=set()`：
+
+```python
+>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
+{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+这一行为可以更精细的调控，将`allowed_special`设计为`str`的集合即可：
+
+```python
+>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'})
+{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+如果希望输入中遇到特殊token的字面表达时，获得更直接的提醒，通过配置`disallowed_special`可以让tokenizer直接触发异常：
+
+```python
+>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'}, disallowed_special=('<|extra_0|>', ))
+...
+ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.
+If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|extra_0|>', ...}`.
+If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|extra_0|>'})`.
+To disable this check for all special tokens, pass `disallowed_special=()`.
+```
+
+更多关于`allowed_special`和`disallowed_special`的信息, 请参阅[`tiktoken`代码](https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/tiktoken/core.py#L75).
+
+新的默认行为与以下设定等价
+
+```python
+>>> tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())
+{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
+```
+