Merge branch 'QwenLM:main' into add_ja-readme

main
Ikko Eltociear Ashimine 1 year ago committed by GitHub
commit 81185f0b3f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,41 +1,49 @@
name: 🐞 Bug
description: File a bug/issue
description: 提交错误报告 | File a bug/issue
title: "[BUG] <title>"
labels: ["Bug"]
body:
- type: checkboxes
attributes:
label: Is there an existing issue for this?
description: Please search to see if an issue already exists for the bug you encountered.
label: 是否已有关于该错误的issue | Is there an existing issue for this?
description: |
请先搜索您遇到的错误是否在已有的issues中提到过。
Please search to see if an issue already exists for the bug you encountered.
options:
- label: I have searched the existing issues
- label: 我已经搜索过已有的issues | I have searched the existing issues
required: true
- type: textarea
attributes:
label: Current Behavior
description: A concise description of what you're experiencing.
label: 当前行为 | Current Behavior
description: |
准确描述遇到的行为。
A concise description of what you're experiencing.
validations:
required: false
- type: textarea
attributes:
label: Expected Behavior
description: A concise description of what you expected to happen.
label: 期望行为 | Expected Behavior
description: |
准确描述预期的行为。
A concise description of what you expected to happen.
validations:
required: false
- type: textarea
attributes:
label: Steps To Reproduce
description: Steps to reproduce the behavior.
label: 复现方法 | Steps To Reproduce
description: |
复现当前行为的详细步骤。
Steps to reproduce the behavior.
placeholder: |
1. In this environment...
1. With this config...
1. Run '...'
1. See error...
2. With this config...
3. Run '...'
4. See error...
validations:
required: false
- type: textarea
attributes:
label: Environment
label: 运行环境 | Environment
description: |
examples:
- **OS**: Ubuntu 20.04
@ -54,8 +62,12 @@ body:
required: false
- type: textarea
attributes:
label: Anything else?
label: 备注 | Anything else?
description: |
您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
Links? References? Anything that will give us more context about the issue you are encountering!
Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.

@ -1,5 +1,5 @@
name: "💡 Feature Request"
description: Create a new ticket for a new feature request
description: 创建新功能请求 | Create a new ticket for a new feature request
title: "💡 [REQUEST] - <title>"
labels: [
"question"
@ -8,39 +8,48 @@ body:
- type: input
id: start_date
attributes:
label: "Start Date"
description: Start of development
label: "起始日期 | Start Date"
description: |
起始开发日期
Start of development
placeholder: "month/day/year"
validations:
required: false
- type: textarea
id: implementation_pr
attributes:
label: "Implementation PR"
description: Pull request used
label: "实现PR | Implementation PR"
description: |
实现该功能的Pull request
Pull request used
placeholder: "#Pull Request ID"
validations:
required: false
- type: textarea
id: reference_issues
attributes:
label: "Reference Issues"
description: Common issues
label: "相关Issues | Reference Issues"
description: |
与该功能相关的issues
Common issues
placeholder: "#Issues IDs"
validations:
required: false
- type: textarea
id: summary
attributes:
label: "Summary"
description: Provide a brief explanation of the feature
placeholder: Describe in a few lines your feature request
label: "摘要 | Summary"
description: |
简要描述新功能的特点
Provide a brief explanation of the feature
placeholder: |
Describe in a few lines your feature request
validations:
required: true
- type: textarea
id: basic_example
attributes:
label: "Basic Example"
label: "基本示例 | Basic Example"
description: Indicate here some basic examples of your feature.
placeholder: A few specific words about your feature request.
validations:
@ -48,16 +57,22 @@ body:
- type: textarea
id: drawbacks
attributes:
label: "Drawbacks"
description: What are the drawbacks/impacts of your feature request ?
placeholder: Identify the drawbacks and impacts while being neutral on your feature request
label: "缺陷 | Drawbacks"
description: |
该新功能有哪些缺陷/可能造成哪些影响?
What are the drawbacks/impacts of your feature request ?
placeholder: |
Identify the drawbacks and impacts while being neutral on your feature request
validations:
required: true
- type: textarea
id: unresolved_question
attributes:
label: "Unresolved questions"
description: What questions still remain unresolved ?
placeholder: Identify any unresolved issues.
label: "未解决问题 | Unresolved questions"
description: |
有哪些尚未解决的问题?
What questions still remain unresolved ?
placeholder: |
Identify any unresolved issues.
validations:
required: false

@ -50,4 +50,4 @@ If you are commercially using the Materials, and your product or service has mor
9. Governing Law and Jurisdiction.
a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

@ -49,4 +49,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SOFTWARE.

@ -1,4 +1,5 @@
<br>
<p align="center">
<img src="assets/logo.jpg" width="400"/>
<p>
@ -50,7 +51,7 @@ In general, Qwen-7B outperforms the baseline models of a similar model size, and
<p>
<br>
For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](techmemo-draft.md).
For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md).
## Requirements
@ -73,6 +74,7 @@ If your device supports fp16 or bf16, we recommend installing [flash-attention](
```bash
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
pip install csrc/layer_norm
pip install csrc/rotary
```
@ -87,8 +89,7 @@ To use Qwen-7B-Chat for the inference, all you need to do is to input a few line
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Note: For tokenizer usage, please refer to examples/tokenizer_showcase.ipynb.
# The default behavior now has injection attack prevention off.
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# use bf16
@ -109,7 +110,7 @@ print(response)
# 你好!很高兴为你提供帮助。
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
@ -147,7 +148,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto",
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to('cuda:0')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是亚的斯亚贝巴Addis Ababa...
@ -184,6 +185,10 @@ response, history = results['response'], results['history']
print(f'Response: {response}')
```
## Tokenizer
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
## Quantization
We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
@ -232,14 +237,14 @@ We provide a CLI demo example in `cli_demo.py`, which supports streaming output
## Tool Usage
Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In the soon-to-be-released internal evaluation benchmark for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
[](https://)
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
|-------------|------------------------|-----------------------|-----------------------|
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B** | **99%** | 0.89 | **8.5%** |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.
@ -288,4 +293,3 @@ Researchers and developers are free to use the codes and model weights of both Q
## Contact Us
If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.

@ -1,4 +1,5 @@
<br>
<p align="center">
<img src="assets/logo.jpg" width="400"/>
<p>
@ -50,7 +51,7 @@ Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、
<p>
<br>
更多的实验结果和细节请查看我们的技术备忘录。点击[这里](techmemo-draft.md)。
更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。
## 要求
@ -73,6 +74,7 @@ pip install -r requirements.txt
```bash
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选,安装可能比较缓慢。
pip install csrc/layer_norm
pip install csrc/rotary
```
@ -87,7 +89,7 @@ pip install csrc/rotary
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 请注意分词器默认行为已更改为默认关闭特殊token攻击防护。相关使用指引请见examples/tokenizer_showcase.ipynb
# 请注意分词器默认行为已更改为默认关闭特殊token攻击防护。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 打开bf16精度A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
@ -108,7 +110,7 @@ print(response)
# 你好!很高兴为你提供帮助。
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
@ -147,7 +149,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto",
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to('cuda:0')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是亚的斯亚贝巴Addis Ababa...
@ -184,6 +186,13 @@ response, history = results['response'], results['history']
print(f'Response: {response}')
```
## Tokenization
> 注作为术语的“tokenization”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
基于tiktoken的tokenizer有别于其他分词器比如sentencepiece tokenizer。尤其在微调阶段需要特别注意特殊token的使用。关于tokenizer的更多信息以及微调时涉及的相关使用请参阅[文档](tokenization_note_zh.md)。
## 量化
如希望使用更低精度的量化模型如4比特和8比特的模型我们提供了简单的示例来说明如何快速使用量化模型。在开始前确保你已经安装了`bitsandbytes`。请注意,`bitsandbytes`的安装要求是:
@ -232,13 +241,13 @@ model = AutoModelForCausalLM.from_pretrained(
## 工具调用
Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。我们在内部的即将开源的评测数据集上测试模型的工具调用能力并发现Qwen-7B-Chat能够取得稳定的表现。
Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力并发现Qwen-7B-Chat能够取得稳定的表现。
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
| ------------- | ------------------------- | ------------------------ | ------------------------ |
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B** | **99%** | 0.89 | **8.5%** |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。
@ -289,4 +298,3 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
## 联系我们
如果你想给我们的研发团队和产品团队留言请通过邮件qianwen_opensource@alibabacloud.com联系我们。

@ -177,7 +177,7 @@ def main():
# Run chat.
set_seed(seed)
try:
for response in model.chat(tokenizer, query, history=history, stream=True):
for response in model.chat_stream(tokenizer, query, history=history):
_clear_screen()
print(f"\nUser: {query}")
print(f"\nQwen-7B: {response}")

@ -1,83 +0,0 @@
# Copyright (c) Alibaba Cloud.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import torch
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.trainer_utils import set_seed
def _load_model_tokenizer(args):
tokenizer = AutoTokenizer.from_pretrained(
args.checkpoint_path, trust_remote_code=True,
)
print("load tokenizer")
if args.cpu_only:
device_map = "cpu"
max_memory = None
else:
device_map = "auto"
max_memory_str = f"{int(torch.cuda.mem_get_info()[0] / 1024 ** 3) - 2}GB"
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory_str for i in range(n_gpus)}
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map=device_map,
max_memory=max_memory,
trust_remote_code=True,
).eval()
return model, tokenizer
def demo_qwen_pretrain(args):
model, tokenizer = _load_model_tokenizer(args)
inputs = tokenizer(
"蒙古国的首都是乌兰巴托Ulaanbaatar\n冰岛的首都是雷克雅未克Reykjavik\n埃塞俄比亚的首都是",
return_tensors="pt",
)
inputs = inputs.to(model.device)
pred = model.generate(inputs=inputs["input_ids"])
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
def demo_qwen_chat(args):
model, tokenizer = _load_model_tokenizer(args)
queries = [
"请问把大象关冰箱总共要几步?",
"1+3=?",
"请将下面这句话翻译为英文:在哪里跌倒就在哪里趴着",
]
history = None
for turn_idx, query in enumerate(queries, start=1):
response, history = model.chat(
tokenizer,
query,
history=history,
)
print(f"===== Turn {turn_idx} ====")
print("Query:", query, end="\n")
print("Response:", response, end="\n")
def main():
parser = argparse.ArgumentParser(description="Test HF checkpoint.")
parser.add_argument("-c", "--checkpoint-path", type=str, help="Checkpoint path")
parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only")
args = parser.parse_args()
set_seed(args.seed)
if "chat" in args.checkpoint_path.lower():
demo_qwen_chat(args)
else:
demo_qwen_pretrain(args)
if __name__ == "__main__":
main()

@ -49,9 +49,9 @@ evaluate_functional_correctness HumanEval_res.jsonl
python evaluate_chat_mmlu.py -f HumanEval.jsonl -o HumanEval_res_chat.jsonl
evaluate_functional_correctness HumanEval_res_chat.jsonl
```
When installing package human-eval, please note its following disclaimer:
This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
- GSM8K
@ -64,3 +64,20 @@ python evaluate_gsm8k.py
python evaluate_chat_gsm8k.py # zeroshot
python evaluate_chat_gsm8k.py --use-fewshot # fewshot
```
- PLUGIN
This script is used to reproduce the results of the ReAct and Hugging Face Agent in the Tool Usage section of the README document.
```Shell
# Qwen-7B-Chat
mkdir data;
cd data;
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
cd ..;
pip install json5;
pip install jsonlines;
pip install rouge_score;
python evaluate_plugin.py --eval-react-positive --eval-react-negative --eval-hfagent
```

@ -0,0 +1,308 @@
import argparse
import json
import os
import pprint
import json5
import jsonlines
from rouge_score import rouge_scorer
from tqdm import tqdm
from transformers import Agent, AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
from transformers.tools.evaluate_agent import evaluate_agent
from transformers.trainer_utils import set_seed
data_root_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
'data')
def is_callable(response, golden):
return response['action'].strip().lower() == golden['action'].strip(
).lower()
def process_res(response):
# parse response
response += '\n' # fix not-find bug
thought = response[:response.find('Action:')].strip()
action = response[response.find('Action:') +
len('Action:'):response.find('Action Input:')].strip()
action_input = response[response.find('Action Input:') +
len('Action Input:'):response.find('Observation:'
)].strip()
#TODO: This parsing result is incorrect if the response contains multiple Actions. To be fixed in the future.
observation = response[response.find('Observation:') +
len('Observation:'):response.rfind('Thought:'
)].strip()
thought_last = response[response.rfind('Thought:') +
len('Thought:'):response.find('Final Answer:'
)].strip()
final_answer = response[response.find('Final Answer:') +
len('Final Answer:'):].strip()
try:
action_input = json.dumps(json5.loads(action_input),
ensure_ascii=False,
sort_keys=True)
except:
# print("JSON Load Error:", action_input)
pass
res_dict = {
'thought': thought,
'action': action,
'action_input': action_input,
'observation': observation,
'thought_last': thought_last,
'final_answer': final_answer
}
return res_dict
class _DummyTokenizer:
def tokenize(self, text: str):
return text.split()
def _get_tokenized_string(tokenizer, text_list):
token_ids_list, tokenized_string_list = [], []
for text in text_list:
assert tokenizer is not None
token_ids = tokenizer.encode(text)
tokens_bytes = tokenizer.convert_ids_to_tokens(token_ids)
tokens = [
token.decode('utf-8', errors='replace') for token in tokens_bytes
]
tokenized_string = ' '.join(tokens)
token_ids_list.append(token_ids)
tokenized_string_list.append(tokenized_string)
return token_ids_list, tokenized_string_list
def eval_action(job):
response = job['gen'][0]
golden = job['response']
if 'Action:' in response:
response, golden = process_res(response), process_res(golden)
if is_callable(response, golden):
return True
return False
def eval_action_input(job, tokenizer):
response = job['gen'][0]
golden = job['response']
response, golden = process_res(response), process_res(golden)
query = job['prompt']
job = {}
job['prompt'] = query
job['gen'] = response['action_input']
job['response'] = golden['action_input']
job['_gen_tok'], job['_gen_tok_str'] = _get_tokenized_string(
tokenizer, [response['action_input']])
job['_reference_tok'], job['_reference_tok_str'] = _get_tokenized_string(
tokenizer, [golden['action_input']])
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'],
tokenizer=_DummyTokenizer())
score = scorer.score(job['_reference_tok_str'][0], job['_gen_tok_str'][0])
rouge = score['rougeL'].fmeasure
return rouge
class QWenAgent(Agent):
"""
Agent that uses QWen model and tokenizer to generate code.
Example:
```py
agent = QWenAgent()
agent.run("Draw me a picture of rivers and lakes.")
```
"""
def __init__(self,
chat_prompt_template=None,
run_prompt_template=None,
additional_tools=None,
tokenizer=None,
model=None):
if tokenizer and model:
self.tokenizer = tokenizer
self.model = model
else:
checkpoint = 'Qwen/Qwen-7B-Chat'
self.tokenizer = AutoTokenizer.from_pretrained(
checkpoint, trust_remote_code=True)
self.model = AutoModelForCausalLM.from_pretrained(
checkpoint, device_map='auto',
trust_remote_code=True).cuda().eval()
self.model.generation_config = GenerationConfig.from_pretrained(
checkpoint, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
self.model.generation_config.do_sample = False # greedy
super().__init__(
chat_prompt_template=chat_prompt_template,
run_prompt_template=run_prompt_template,
additional_tools=additional_tools,
)
def generate_one(self, prompt, stop):
# "Human:" 和 "Assistant:" 曾为通义千问的特殊保留字,需要替换为 "_HUMAN_:" 和 "_ASSISTANT_:"。这一问题将在未来版本修复。
prompt = prompt.replace('Human:',
'_HUMAN_:').replace('Assistant:',
'_ASSISTANT_:')
stop = [
item.replace('Human:', '_HUMAN_:').replace('Assistant:',
'_ASSISTANT_:')
for item in stop
]
result, _ = self.model.chat(self.tokenizer, prompt, history=None)
for stop_seq in stop:
if result.endswith(stop_seq):
result = result[:-len(stop_seq)]
result = result.replace('_HUMAN_:',
'Human:').replace('_ASSISTANT_:', 'Assistant:')
return result
def load_models_tokenizer(args):
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path,
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path,
device_map='auto',
trust_remote_code=True,
bf16=True,
use_flash_attn=True).eval()
model.generation_config = GenerationConfig.from_pretrained(
args.checkpoint_path, trust_remote_code=True)
model.generation_config.do_sample = False # use greedy decoding
return model, tokenizer
def load_jobs(filename):
jobs = []
with jsonlines.open(os.path.join(data_root_path, filename),
mode='r') as reader:
for job in reader:
jobs.append(job)
return jobs
def react_inference(filename, model, tokenizer):
filename_cache = filename + '.cache'
if os.path.exists(os.path.join(data_root_path, filename_cache)):
jobs = load_jobs(filename=filename_cache)
print('Loaded from', filename_cache)
else:
with open(os.path.join(data_root_path, filename_cache), 'w') as f:
jobs = load_jobs(filename=filename)
print('Inference:', filename)
for job in tqdm(jobs):
response, history = model.chat(tokenizer,
job['prompt'],
history=None)
job['gen'] = [response]
f.writelines(json.dumps(job, ensure_ascii=False) + '\n')
print(filename_cache, 'is saved.')
return jobs
def main(args):
print('loading model weights')
if args.checkpoint_path is not None:
model, tokenizer = load_models_tokenizer(args)
else:
model, tokenizer = None, None
print('model loaded')
result = {}
# eval react positive
if args.eval_react_positive:
print('eval react positive ...')
acc_count = 0
rouge_mean = 0
jobs = react_inference(filename=args.eval_react_positive_filename,
model=model,
tokenizer=tokenizer)
for job in jobs:
if eval_action(job):
acc_count += 1
rouge = eval_action_input(job, tokenizer)
rouge_mean += (rouge / len(jobs))
scores = {
'action_right_rate': acc_count / len(jobs),
'action_input_rouge': rouge_mean,
}
result.update({'react_positive': scores})
# eval react negative
if args.eval_react_negative:
print('eval react negative ...')
bad_count = 0
jobs = react_inference(filename=args.eval_react_negative_filename,
model=model,
tokenizer=tokenizer)
for job in jobs:
if '\nAction:' in job['gen'][0]:
bad_count += 1
scores = {'bad_rate': bad_count / len(jobs)}
result.update({'react_negative': scores})
# eval hfagent
if args.eval_hfagent:
print('eval hfagent ...')
agent = QWenAgent(model=model, tokenizer=tokenizer)
scores = evaluate_agent(agent, verbose=False, return_errors=False)
result.update({'hfagent': scores})
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(result)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Test HF checkpoint.')
parser.add_argument('-c',
'--checkpoint-path',
type=str,
help='Checkpoint path',
default='Qwen/Qwen-7B-Chat')
parser.add_argument('-s',
'--seed',
type=int,
default=1234,
help='Random seed')
"""Provide extra arguments required for tasks."""
group = parser.add_argument_group(title='Evaluation options')
group.add_argument('--eval-react-positive',
action='store_true',
default=False,
help='Eval react positive.')
group.add_argument('--eval-react-positive-filename',
type=str,
default='exam_plugin_v1_react_positive.jsonl',
help='Eval react positive filename.')
group.add_argument('--eval-react-negative',
action='store_true',
default=False,
help='Eval react negative.')
group.add_argument('--eval-react-negative-filename',
type=str,
default='exam_plugin_v1_react_negative.jsonl',
help='Eval react negative filename.')
group.add_argument('--eval-hfagent',
action='store_true',
default=False,
help='Eval hfagent.')
args = parser.parse_args()
set_seed(args.seed)
main(args)

@ -242,4 +242,4 @@ def parse_latest_plugin_call(text: str) -> Tuple[str, str]:
return '', ''
```
此外,如果输出的 Action Input 内容是一段表示 JSON 对象的文本,我们建议使用 `json5` 包的 `json5.loads(...)` 方法加载。
此外,如果输出的 Action Input 内容是一段表示 JSON 对象的文本,我们建议使用 `json5` 包的 `json5.loads(...)` 方法加载。

@ -4,3 +4,4 @@ tiktoken
einops
transformers_stream_generator==0.0.4
bitsandbytes
scipy

@ -311,13 +311,13 @@ LLMs have shown capability in coordinating multiple external systems to achieve
Qwen supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629).
ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework.
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md).
In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen's performance is as follows:
In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, Qwen's performance is as follows:
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
| :---------- | --------------------------: | -------------------------: | -------------------------: |
| GPT-4 | 95% | **0.90** | 15.0% |
| GPT-3.5 | 85% | 0.88 | 75.0% |
| **Qwen-7B** | **99%** | 0.89 | **8.5%** |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
> The plugins that appear in the evaluation set do not appear in the training set of Qwen.
> This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate.

@ -0,0 +1,127 @@
# Tokenization
Qwen-7B uses BPE tokenization on UTF-8 bytes using the `tiktoken` package.
There are two types of tokens in Qwen-7B, i.e., the regular tokens (of type `bytes`) in BPE and the special/control tokens (of type `str`).
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
```
## Regular tokens
The regular tokens are BPE tokens learned from byte sequences of texts encoded using the UTF-8 encoding.
While this allows tokenization of all texts and no unknown token exists, it may fall back to using single bytes when tokenizing uncommon texts.
You may encounter UTF-8 decoding errors and as the errors are default to `replace`, thus the replacement character (<28>) in incomplete generation.
You can change this behavior by passing `errors="ignore"` to the `decode` function for once or to the `from_pretrained` function forever.
For more options of `errors`, please refer to [the Python documentation](https://docs.python.org/3/library/stdtypes.html#bytes.decode).
```python
>>> tokenizer.decode([51461])
' <20>'
>>> tokenizer.convert_ids_to_tokens([51461])
[b' \xe6\xa0']
>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
' <20>'
>>> tokenizer.decode([51461, 117])
' 根'
>>> tokenizer.convert_ids_to_tokens([51461, 117])
[b' \xe6\xa0', b'\xb9']
>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
' 根'
```
The mapping from regular tokens (in `bytes`) to its ID can be retrieved from `tokenizer.get_vocab()`.
We do not support or recommended adding regular tokens to the vocabulary.
## Special tokens
The special tokens signify special functions to the model, e.g., reaching the end of a document.
In theory, they do not exist in the input texts and only appear after the input texts are processed.
Their surface forms, e.g., `<|endoftext|>` for the end of a document, are only meant for ease of reference.
Currently, used special tokens are `<|endoftext|>` in Qwen-7B, and `<|endoftext|>`, `<|im_start|>`, and `<|im_end|>` in Qwen-7B-Chat, which means they have determined meanings to the corresponding model, and should not be used otherwise.
For other purposes, we keep extra special tokens from `<|extra_0|>` to `<|extra_204|>`, and you can use them as you wish.
The mapping from surface forms of the special tokens (in `str`) to its ID can be retrieved from `tokenizer.special_tokens`.
The concepts of `bos`, `eos`, `unk`, `pad`, `mask`, `sep` and such are not appliable to our pretrained models (Qwen-7B and Qwen-7B-Chat).
The `pad` token, however, is a different story, as in theory, the model never sees or computes this token, so you may use any known token.
But to be safe, we limit the value of special tokens specified in the initialization of the tokenizer to the known special tokens.
You may specify special tokens in fine-tuning or in any other frameworks that necessitate them like this
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, pad_token='<|endoftext|>')
```
> WARNING: For our pretrained models, setting `bos`, `eos`, `unk`, and such makes no sense.
> Unknown behavior may be introduced if you set them without fine-tuning that designates their meanings to the model.
> Especially, you should not use `<|endoftext|>` as `eos`, unless you are sure that the end of a sentence and the end of a document, which may contain many sentences, are the same in your scenario.
## Injection attack prevention
As special tokens are different from regular tokens, what will happen if the surface forms of a control token appear in the input texts?
For example, note that a piece of text like this
```
print("<|endoftext|>")
```
should be tokenized as
```
ids:[1350, 9639, 91, 8691, 723, 427, 91, 82598]
tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']
```
not
```
ids: [1350, 445, 151643, 899]
tokens: [b'print', b'("', '<|endoftext|>', b'")']
```
Our default used to be the correct one, that is, treating the surface forms of special tokens just like regular texts, and special tokens should be taken cared of by developers after tokenization of the texts.
However, this conflicts with (albeit unsafe) practice in the community, and adds another step for developers to reuse their wheels.
The default behavior has been changed to parse the surface forms of all the known special tokens as special tokens.
To enable injection prevention, pass `allowed_special=set()` to the calls of the tokenizer:
```python
>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
```
You can control the behavior in a fine-grained manner by passing a set of `str` as `allowed_special`
```python
>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'})
{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
You can also make the tokenizer raise errors if the surface forms of certain special tokens are encountered in the input texts by passing a collection of `str` as `disallowed_special`
```python
>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'}, disallowed_special=('<|extra_0|>', ))
...
ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|extra_0|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|extra_0|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
```
For more information on `allowed_special` and `disallowed_special`, please refer to [the `tiktoken` documentation](https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/tiktoken/core.py#L75).
The new default is the same as
```python
>>> tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())
{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
```

@ -0,0 +1,130 @@
# Tokenization
> 注作为术语的“tokenization”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
Qwen-7B采用UTF-8字节级别的BPE tokenization方式并依赖`tiktoken`这一高效的软件包执行分词。
Qwen-7B中有两类token即源于BPE、`bytes`类型的普通token和特殊指定、`str`类型的特殊token。
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
```
## 普通token
普通token源于BPE是在UTF-8编码的文本字节序列上学习得到的。
尽管基于字节序列的方式保证了所有文本均可被tokenize且没有未登录token问题但处理罕见文本时有可能回退到字节级别的编码。
由于从字节序列解码为文本时,`errors`参数设为`replace`处理不完整的token序列可能会遇到UTF-8解码错误表象是生成中包含“替换字符”(<28>)。
这一行为可以通过将`errors`参数设为`ignore`来规避。
一次性修改可以传入tokenizer的`decode`函数持久性修改可以传入tokenizer的初始化函数请注意`decode`的配置优先级更高。
`errors`的可选值,请参阅[Python文档](https://docs.python.org/3/library/stdtypes.html#bytes.decode).
```python
>>> tokenizer.decode([51461])
' <20>'
>>> tokenizer.convert_ids_to_tokens([51461])
[b' \xe6\xa0']
>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
' <20>'
>>> tokenizer.decode([51461, 117])
' 根'
>>> tokenizer.convert_ids_to_tokens([51461, 117])
[b' \xe6\xa0', b'\xb9']
>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
' 根'
```
`bytes`类型的普通token到id的映射可以通过`tokenizer.get_vocab()`获取。
尚不支持也不推荐向tokenizer增加普通token。
## 特殊token
特殊token用以给模型传递特殊信号如到达文本末尾。
理论上输入文本中不包含特殊token它们仅在tokenization后由开发者手动加入。
特殊token的字面表达如表示文本结束的`<|endoftext|>`仅便于指代特殊token不意味着它们在输入文本空间中。
目前训练中使用的、已经有固定含义的、不应做它用的特殊tokenQwen-7B中有`<|endoftext|>`Qwen-7B-Chat中有`<|endoftext|>`、`<|im_start|>`以及`<|im_end|>`。
但词表中也留有供扩展的特殊token位可用`<|extra_0|>`到`<|extra_204|>`来指代。
`str`类型的特殊token字面表达到id的映射可以通过`tokenizer.special_tokens`获取。
对于提供的模型参数(Qwen-7B和Qwen-7B-Chat)而言,诸如`bos`、`eos`、`unk`、`pad`、`mask`、`sep`等的特殊token的概念并不适用。
特例是`pad`由于这个token理论上并不参与模型计算所以可以使用任意token表达这一概念。
但保险起见目前可在tokenizer初始化时设定的特殊token仅可使用已知的特殊token字面表达即`<|endoftext|>`、`<|im_start|>`、`<|im_end|>`和`<|extra_0|>`到`<|extra_204|>`。
对于微调或者其它需要这些token才能运行的框架可以如下配置
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, pad_token='<|endoftext|>')
```
> 注意: 对于提供的训练好的模型,设置诸如`bos`、`eos`、`unk`之类的没有意义,即模型不需要这些概念。
> 如果设置了这些token但没有相应的微调这些token以让模型理解其含义未知行为可能被触发。
> 特别时,不应混淆`<|endoftext|>`和`eos`的概念,除非应用场景中它们的实际含义是一致的,即句子末尾等价于文本末尾。
**注入攻击防御**
由于特殊token和普通token概念上的差异如果输入文本中含有特殊token的字面表达该如何处理
以下面文本为例
```
print("<|endoftext|>")
```
其正确的tokenization为
```
ids:[1350, 9639, 91, 8691, 723, 427, 91, 82598]
tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']
```
不是
```
ids: [1350, 445, 151643, 899]
tokens: [b'print', b'("', '<|endoftext|>', b'")']
```
默认行为曾是正确的即输入文本中任何字符一律按普通token处理特殊token应由开发者在tokenization人工处理。
然后,这与社区中的实践似有差异,为开发者复用代码增加了额外适配步骤。
默认行为已被调整为从输入文本中解析特殊token的字面表达。
如需启用注入攻击防御,请传入参数`allowed_special=set()`
```python
>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
```
这一行为可以更精细的调控,将`allowed_special`设计为`str`的集合即可:
```python
>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'})
{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
如果希望输入中遇到特殊token的字面表达时获得更直接的提醒通过配置`disallowed_special`可以让tokenizer直接触发异常
```python
>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'}, disallowed_special=('<|extra_0|>', ))
...
ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|extra_0|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|extra_0|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
```
更多关于`allowed_special`和`disallowed_special`的信息, 请参阅[`tiktoken`代码](https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/tiktoken/core.py#L75).
新的默认行为与以下设定等价
```python
>>> tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())
{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
```
Loading…
Cancel
Save