commit ba2d85a13b28ed1ee0dde2d6c3e4d5a55dc5964c Author: JustinLin610 Date: Thu Aug 3 12:57:53 2023 +0800 first commit diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..39e9066 --- /dev/null +++ b/.gitignore @@ -0,0 +1,11 @@ +__pycache__ +*.so +build +.coverage_* +*.egg-info +*~ +.vscode/ +.idea/ +.DS_Store + +/private/ diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..82f7683 --- /dev/null +++ b/LICENSE @@ -0,0 +1,54 @@ +Tongyi Qianwen LICENSE AGREEMENT + +Tongyi Qianwen Release Date: August 3, 2023 + +By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. + +1. Definitions + a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. + b. "We"(or "Us") shall mean Alibaba Cloud. + c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. + d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. + e. "Tongyi Qianwen" shall mean the large language models (including Qwen-7b model and Qwen-7b-Chat model ), and software and +algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. + f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. + g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. + h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, + and conversions to other media types. + +2. Grant of Rights +You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials. + +3. Redistribution +You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: + a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; + b. You shall cause any modified files to carry prominent notices stating that You changed the files; + c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies:"Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and + d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. + +4. Restrictions +If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization. + +5. Rules of use + a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. + b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). + +6. Intellectual Property + a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. + b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. + c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. + +7. Disclaimer of Warranty and Limitation of Liability + + a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Model or to grant any license thereto. + b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. + c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. + d. You will indemnify and hold armless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. + +8. Survival and Termination. + a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. + b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement. + +9. Governing Law and Jurisdiction. + a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. + b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. \ No newline at end of file diff --git a/NOTICE b/NOTICE new file mode 100644 index 0000000..421295c --- /dev/null +++ b/NOTICE @@ -0,0 +1,27 @@ +------------- LICENSE FOR NVIDIA Megatron-LM code -------------- + +Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + * Neither the name of NVIDIA CORPORATION nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..0bb589c --- /dev/null +++ b/README.md @@ -0,0 +1,244 @@ +

+ +

+
+ +

+ Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report +

+
+ +

+ 中文  |  English +

+

+ +We opensource **Qwen-7B** and **Qwen-7B-Chat** on both **🤖 ModelScope** and **🤗 Hugging Face** (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo [link](tech_memo.md) that provides more information. + +Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include: + +1. **Trained with high-quality pretraining data**. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional damain data. +2. **Strong performance**. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc. +3. **Better support of languages**. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune `Qwen-7B` for the extension of understanding a certain language. +4. **Support of 8K Context Length**. Both Qwen-7B and Qwen-7B-Chat supports the context length of 8K, which allows inputs with long contexts. +5. **Support of Plugins**. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent. + +## News + +* 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance. + +## Performance + +In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperform larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below. + +| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | +| :---------------- | -------------- | -------------: | -------------: | -------------: | -------------: | +| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | +| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | +| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | +| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | +| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | +| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | +| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | +| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | +| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | + +For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our techinical memo by clicking [here](techmemo-draft.md). + +## Quickstart + +Below, we provide simple examples to show how to use Qwen-7B with 🤖 ModelScope and 🤗 Transformers. + +Before running the code, make sure you have setup the environment and installed the required packages. Make sure the pytorch version is higher than `1.12`, and then install the dependent libraries. + +```bash +pip install transformers==4.31.0 accelerate tiktoken einops +``` + +We recommend installing `flash-attention` for higher efficiency and lower memory usage. + +```bash +git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention +cd flash-attention && pip install . +pip install csrc/layer_norm +pip install csrc/rotary +``` + +Now you can start with ModelScope or Transformers. + +#### 🤗 Transformers + +To use Qwen-7B for the inference, all you need to do is to input a few lines of codes as demonstrated below: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt') +inputs = inputs.to('cuda:0') +pred = model.generate(**inputs) +print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) +# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)... +``` + +Running Qwen-7B-Chat is also simple. We provide you with an example of IPython to show how to interactive with the model. + +```ipython +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> from transformers.generation import GenerationConfig + +>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) +>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval() +>>> model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +>>> # 第一轮对话 1st dialogue turn +>>> response, history = model.chat(tokenizer, "你好", history=None) +>>> print(response) +你好!很高兴为你提供帮助。 +>>> # 第二轮对话 2nd dialogue turn +>>> response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +>>> print(response) +这是一个关于一个年轻人奋斗创业最终取得成功的故事。 + +故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。 + +为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。 + +毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。 + +最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。 + +李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。 +>>> # 第三轮对话 3rd dialogue turn +>>> response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history) +>>> print(response) +《奋斗创业:一个年轻人的成功之路》 +``` + +#### 🤖 ModelScope + +ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: + +``` +import os +from modelscope.pipelines import pipeline +from modelscope.utils.constant import Tasks +from modelscope import snapshot_download + +model_id = 'QWen/qwen-7b-chat' +revision = 'v1.0.0' + +model_dir = snapshot_download(model_id, revision) + +pipe = pipeline( +task=Tasks.chat, model=model_dir, device_map='auto') +history = None + +text = '浙江的省会在哪里?' +results = pipe(text, history=history) +response, history = results['response'], results['history'] +print(f'Response: {response}') +text = '它有什么好玩的地方呢?' +results = pipe(text, history=history) +response, history = results['response'], results['history'] +print(f'Response: {response}') +``` + +## Quantization + +To load the model in lower precision, e.g., 4 bits and 8 bits, we provide examples to show how to load by adding quantization configuration: + +```python +from transformers import BitsAndBytesConfig + +# quantization configuration for NF4 (4 bits) +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type='nf4', + bnb_4bit_compute_dtype=torch.bfloat16 +) + +# quantization configuration for Int8 (8 bits) +quantization_config = BitsAndBytesConfig(load_in_8bit=True) + +model = AutoModelForCausalLM.from_pretrained( + args.checkpoint_path, + device_map="cuda:0", + quantization_config=quantization_config, + max_memory=max_memory, + trust_remote_code=True, +).eval() +``` + +With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs. + +| Precision | MMLU | Memory | +| :---------: | -------: | -------: | +| BF16 | 56.7 | 16.2G | +| Int8 | 52.8 | 10.1G | +| NF4 | 48.9 | 7.4G | + +## Tool Usage + +Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In the soon-to-be-released internal evaluation benchmark for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance. +[](https://) + +| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | +|-------------|------------------------|-----------------------|-----------------------| +| GPT-4 | 95% | **0.90** | 15% | +| GPT-3.5 | 85% | 0.88 | 75% | +| **Qwen-7B** | **99%** | 0.89 | **8.5%** | + +For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks. + +Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows: + +| Model | Tool Selection↑ | Tool Used↑ | Code↑ | +|-|-|-|-| +|GPT-4 | **100** | **100** | **97.41** | +|GPT-3.5 | 95.37 | 96.30 | 87.04 | +|StarCoder-15.5B | 87.04 | 87.96 | 68.89 | +| **Qwen-7B** | 90.74 | 92.59 | 74.07 | + +## Long-Context Understanding + +To extend the context length and break the botteneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, LogN attention scaling, to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen-7B can reach outstanding performance in the scenario of long context. Results are demonstrated below: + + + + + + + + + + + + + + + + + + + + +
ModelSequence Length
102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32
+ +## Reproduction + +For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results and improve your own model. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. + +## License Agreement + +Researchers and developers are free to use the codes and model weights of both Qwen-7B and Qwen-7B-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. + +## Contact Us + +If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com. + diff --git a/README_CN.md b/README_CN.md new file mode 100644 index 0000000..c95b909 --- /dev/null +++ b/README_CN.md @@ -0,0 +1,245 @@ +

+ +

+
+ +

+ Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report +

+
+ +

+ 中文  |  English +

+

+ +我们在🤖 **ModelScope**以及🤗 **Hugging Face**均开源了`Qwen-7B`系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括`Qwen-7B`的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息,请点击[链接](tech_memo.md)查看我们的技术备忘录。 + +通义千问-7B(`Qwen-7B`) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。`Qwen-7B`是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在`Qwen-7B`的基础上,我们使用对齐机制打造了基于大语言模型的AI助手`Qwen-7B-Chat`。`Qwen-7B`系列模型的特点包括: + +1. **大规模高质量预训练数据**:我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型,覆盖通用领域和专业领域。 +2. **优秀的模型性能**:相比同规模的开源模型,`Qwen-7B`在多个评测数据集上具有显著优势,甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。 +3. **更好地支持多语言**:基于更大词表的分词器在分词上更高效,同时它对其他语言表现更加友好。用户可以在`Qwen-7B`的基础上更方便地训练特定语言的7B语言模型。 +4. **8K的上下文长度**:`Qwen-7B`及`Qwen-7B-Chat`均能支持8K的上下文长度, 允许用户输入更长的prompt。 +5. **支持插件调用**:`Qwen-7B-Chat`针对插件调用相关的对齐数据做了特定优化,当前模型能有效调用插件以及升级为Agent。 + +## 新闻 + +* 2023年8月3日 在魔搭社区(ModelScope)和Hugging Face同步推出`Qwen-7B`和`Qwen-7B-Chat`模型。同时,我们发布了技术备忘录,介绍了相关的训练细节和模型表现。 + +## 评测表现 + +`Qwen-7B`在多个全面评估自然语言理解与生成、数学运算解题、代码生成等能力的评测数据集上,包括MMLU、C-Eval、GSM8K、HumanEval、WMT22等,均超出了同规模大语言模型的表现,甚至超出了如12-13B参数等更大规模的语言模型。 + +| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | +| :------------- | ---------- | ---------: | ---------: | ----------: | --------------: | +| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | +| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | +| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | +| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | +| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | +| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | +| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | +| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | +| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | + +更多的实验结果和细节请查看我们的技术备忘录。点击[这里](techmemo-draft.md)。 + +## 快速使用 + +我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用`Qwen-7B`和`Qwen-7B-Chat`。 + +在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你的pytorch版本高于`1.12`,然后安装相关的依赖库。 + +```bash +pip install transformers==4.31.0 accelerate tiktoken einops +``` + +我们还推荐安装`flash-attention`来提高你的运行效率以及降低显存占用。 + +```bash +git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention +cd flash-attention && pip install . +pip install csrc/layer_norm +pip install csrc/rotary +``` + +接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。 + +#### 🤗 Transformers + +如希望使用`Qwen-7B`进行推理,所需要写的只是如下所示的数行代码: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt') +inputs = inputs.to('cuda:0') +pred = model.generate(**inputs) +print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) +# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)... +``` + +运行Qwen-7B-Chat同样非常简单。下面是一个IPython的示例来展示如何交互式地使用`Qwen-7B-Chat`。 + +```ipython +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> from transformers.generation import GenerationConfig + +>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) +>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval() +>>> model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +>>> # 第一轮对话 1st dialogue turn +>>> response, history = model.chat(tokenizer, "你好", history=None) +>>> print(response) +你好!很高兴为你提供帮助。 +>>> # 第二轮对话 2nd dialogue turn +>>> response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) +>>> print(response) +这是一个关于一个年轻人奋斗创业最终取得成功的故事。 + +故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。 + +为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。 + +毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。 + +最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。 + +李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。 +>>> # 第三轮对话 3rd dialogue turn +>>> response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history) +>>> print(response) +《奋斗创业:一个年轻人的成功之路》 +``` + +#### 🤖 ModelScope + +魔搭(ModelScope)是开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单,代码如下所示: + +``` +import os +from modelscope.pipelines import pipeline +from modelscope.utils.constant import Tasks +from modelscope import snapshot_download + +model_id = 'QWen/qwen-7b-chat' +revision = 'v1.0.0' + +model_dir = snapshot_download(model_id, revision) + +pipe = pipeline( +task=Tasks.chat, model=model_dir, device_map='auto') +history = None + +text = '浙江的省会在哪里?' +results = pipe(text, history=history) +response, history = results['response'], results['history'] +print(f'Response: {response}') +text = '它有什么好玩的地方呢?' +results = pipe(text, history=history) +response, history = results['response'], results['history'] +print(f'Response: {response}') +``` + +## 量化 + +如希望使用更低精度的量化模型,如4比特和8比特的模型,我们提供了简单的示例来说明如何快速使用量化模型: + +```python +from transformers import BitsAndBytesConfig + +# quantization configuration for NF4 (4 bits) +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type='nf4', + bnb_4bit_compute_dtype=torch.bfloat16 +) + +# quantization configuration for Int8 (8 bits) +quantization_config = BitsAndBytesConfig(load_in_8bit=True) + +model = AutoModelForCausalLM.from_pretrained( + args.checkpoint_path, + device_map="cuda:0", + quantization_config=quantization_config, + max_memory=max_memory, + trust_remote_code=True, +).eval() +``` + +上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取,帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失,但模型的显存开销大幅降低。 + +| Precision | MMLU | Memory | +| :---------: | -------: | -----: | +| BF16 | 56.7 | 16.2G | +| Int8 | 52.8 | 10.1G | +| NF4 | 48.9 | 7.4G | + +## 工具调用 + +`Qwen-7B-Chat` 针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于`Qwen-7B`的LangChain、Agent甚至Code Interpreter。我们在内部的即将开源的评测数据集上测试模型的工具调用能力,并发现`Qwen-7B-Chat`能够取得稳定的表现。 + +| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | +| ------------- | ------------------------- | ------------------------ | ------------------------ | +| GPT-4 | 95% | **0.90** | 15% | +| GPT-3.5 | 85% | 0.88 | 75% | +| **Qwen-7B** | **99%** | 0.89 | **8.5%** | + +我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。 + +For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md)。 + +此外,我们还提供了实验结果表明我们的模型扮演Agent的能力。请阅读相关文档[链接](https://huggingface.co/docs/transformers/transformers_agents)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下: + +| Model | Tool Selection↑ | Tool Used↑ | Code↑ | +| ----------------- | ------------------ | ------------- | ----------- | +| GPT-4 | **100** | **100** | **97.41** | +| GPT-3.5 | 95.37 | 96.30 | 87.04 | +| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | +| **Qwen-7B** | 90.74 | 92.59 | 74.07 | + +## 长文本理解 + +我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。我们的模型已经突破8K的序列长度。通过arXiv数据集上的语言模型实验,我们发现`Qwen-7B`能够在长序列的设置下取得不错的表现。 + + + + + + + + + + + + + + + + + + + + +
ModelSequence Length
102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
+ +## 复现 + +我们提供了评测脚本以供复现我们的实验结果。注意,由于内部代码和开源代码存在少许差异,评测结果可能与汇报结果存在细微的结果不一致。请阅读[eval/EVALUATION.md](eval/EVALUATION.md)了解更多信息。 + +## 使用协议 + +研究人员与开发者可使用`Qwen-7B`和`Qwen-7B-Chat`或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。 + +## 联系我们 + +如果你想给我们的研发团队和产品团队留言,请通过邮件(qianwen_opensource@alibabacloud.com)联系我们。 + diff --git a/assets/logo.jpg b/assets/logo.jpg new file mode 100644 index 0000000..6d11f3a Binary files /dev/null and b/assets/logo.jpg differ diff --git a/assets/qwen_tokenizer.png b/assets/qwen_tokenizer.png new file mode 100644 index 0000000..a6b0366 Binary files /dev/null and b/assets/qwen_tokenizer.png differ diff --git a/assets/react_showcase_001.png b/assets/react_showcase_001.png new file mode 100644 index 0000000..474c59f Binary files /dev/null and b/assets/react_showcase_001.png differ diff --git a/assets/react_showcase_002.png b/assets/react_showcase_002.png new file mode 100644 index 0000000..eef8ce6 Binary files /dev/null and b/assets/react_showcase_002.png differ diff --git a/assets/react_tutorial_001.png b/assets/react_tutorial_001.png new file mode 100644 index 0000000..b9629be Binary files /dev/null and b/assets/react_tutorial_001.png differ diff --git a/assets/react_tutorial_002.png b/assets/react_tutorial_002.png new file mode 100644 index 0000000..1d9ede6 Binary files /dev/null and b/assets/react_tutorial_002.png differ diff --git a/assets/tokenizer.pdf b/assets/tokenizer.pdf new file mode 100644 index 0000000..f33e7e5 Binary files /dev/null and b/assets/tokenizer.pdf differ diff --git a/assets/tokenizer.png b/assets/tokenizer.png new file mode 100644 index 0000000..b16c0cd Binary files /dev/null and b/assets/tokenizer.png differ diff --git a/assets/wanx_colorful_black.png b/assets/wanx_colorful_black.png new file mode 100644 index 0000000..47332ae Binary files /dev/null and b/assets/wanx_colorful_black.png differ diff --git a/demo.py b/demo.py new file mode 100644 index 0000000..2950f05 --- /dev/null +++ b/demo.py @@ -0,0 +1,81 @@ +# Copyright (c) Alibaba Cloud. +# +# This source code is licensed under the license found in the +# LICENSE file in the root directory of this source tree. + +import torch +import argparse +from pathlib import Path +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +from transformers.trainer_utils import set_seed + + +def demo_qwen_pretrain(args): + tokenizer = AutoTokenizer.from_pretrained( + args.checkpoint_path, trust_remote_code=True + ) + print("load tokenizer") + max_memory = f"{int(torch.cuda.mem_get_info()[0] / 1024 ** 3) - 2}GB" + + n_gpus = torch.cuda.device_count() + max_memory = {i: max_memory for i in range(n_gpus)} + model = AutoModelForCausalLM.from_pretrained( + args.checkpoint_path, + device_map="cuda:0", + max_memory=max_memory, + trust_remote_code=True, + ).eval() + inputs = tokenizer( + "蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是", + return_tensors="pt", + ) + inputs = inputs.to(model.device) + pred = model.generate(inputs=inputs["input_ids"]) + print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) + + +def demo_qwen_chat(args): + tokenizer = AutoTokenizer.from_pretrained( + args.checkpoint_path, trust_remote_code=True + ) + print("load tokenizer") + max_memory = f"{int(torch.cuda.mem_get_info()[0] / 1024 ** 3) - 2}GB" + + n_gpus = torch.cuda.device_count() + max_memory = {i: max_memory for i in range(n_gpus)} + model = AutoModelForCausalLM.from_pretrained( + args.checkpoint_path, + device_map="cuda:0", + max_memory=max_memory, + trust_remote_code=True, + ).eval() + queries = [ + "请问把大象关冰箱总共要几步?", + "1+3=?", + "请将下面这句话翻译为英文:在哪里跌倒就在哪里趴着", + ] + history = None + for turn_idx, query in enumerate(queries, start=1): + response, history = model.chat( + tokenizer, + query, + history=history, + ) + print(f"===== Turn {turn_idx} ====") + print("Query:", query, end="\n") + print("Response:", response, end="\n") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Test HF checkpoint.") + parser.add_argument("-c", "--checkpoint-path", type=Path, help="Checkpoint path") + parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed") + parser.add_argument("--gpu", type=int, default=0, help="gpu id") + + args = parser.parse_args() + set_seed(args.seed) + + if 'chat' in args.checkpoint_path.lower(): + demo_qwen_chat(args) + else: + demo_qwen_pretrain(args) \ No newline at end of file diff --git a/eval/EVALUATION.md b/eval/EVALUATION.md new file mode 100644 index 0000000..09b009b --- /dev/null +++ b/eval/EVALUATION.md @@ -0,0 +1,45 @@ +## 评测复现 + +- CEVAL + +```Shell +wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip +mkdir data/ceval +mv ceval-exam.zip data/ceval +cd data/ceval; unzip ceval-exam.zip +cd ../../ +python evaluate_ceval.py -d data/ceval/ +``` + +- MMLU + +```Shell +wget https://people.eecs.berkeley.edu/~hendrycks/data.tar +mkdir data/mmlu +mv data.tar data/mmlu +cd data/mmlu; tar xf data.tar +cd ../../ +python evaluate_mmlu.py -d data/mmlu/data/ +``` + +- HumanEval + +Get the HumanEval.jsonl file from [here](https://github.com/openai/human-eval/tree/master/data) + +```Shell +python evaluate_humaneval.py -f HumanEval.jsonl -o HumanEval_res.jsonl +git clone https://github.com/openai/human-eval +pip install -e human-eval +evaluate_functional_correctness HumanEval_res.jsonl +``` + +When installing package human-eval, please note its following disclaimer: + +This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions. + + +- GSM8K + +```Shell +python evaluate_gsm8k.py +``` \ No newline at end of file diff --git a/eval/evaluate_ceval.py b/eval/evaluate_ceval.py new file mode 100644 index 0000000..265af55 --- /dev/null +++ b/eval/evaluate_ceval.py @@ -0,0 +1,263 @@ +import os +import pandas as pd +import numpy as np +import argparse +import datasets +import torch + +from typing import List +from tqdm import tqdm +from transformers.trainer_utils import set_seed + + +''' +wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip +mkdir data/ceval +mv ceval-exam.zip data/ceval +cd data/ceval; unzip ceval-exam.zip +cd ../../ +python evaluate_ceval.py -d data/ceval/ +''' + +def load_models_tokenizer(args): + from transformers import AutoModelForCausalLM, AutoTokenizer + from transformers.generation import GenerationConfig + + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto", trust_remote_code=True).eval() + model.generation_config = GenerationConfig.from_pretrained(args.checkpoint_path, trust_remote_code=True) + return model, tokenizer + + +def format_example(line, include_answer=True): + example = '问题:' + line['question'] + for choice in choices: + example += f'\n{choice}. {line[f"{choice}"]}' + + if include_answer: + example += '\n答案:' + line["answer"] + '\n\n' + else: + example += '\n答案:' + return example + + +def generate_few_shot_prompt(k, subject, dev_df): + prompt = '' + if k == -1: + k = dev_df.shape[0] + for i in range(k): + prompt += format_example( + dev_df.iloc[i, :], + include_answer=True, + ) + return prompt + + +def get_logits(tokenizer, model, inputs: List[str]): + input_ids = tokenizer(inputs, padding=False)['input_ids'] + input_ids = torch.tensor(input_ids, device=model.device) + tokens = {'input_ids': input_ids} + + outputs = model(input_ids)['logits'] + logits = outputs[:, -1, :] + log_probs = torch.nn.functional.softmax(logits, dim=-1) + return log_probs, {'tokens': tokens} + + +@torch.no_grad() +def eval_subject( + model, + tokenizer, + subject_name, + test_df, + k=5, + dev_df=None, + few_shot=False, + save_result_dir=None, + **kwargs +): + result = [] + score = [] + + few_shot_prompt = generate_few_shot_prompt( + k, subject_name, dev_df) if few_shot else [] + all_probs = {'prob_A': [], 'prob_B': [], 'prob_C': [], 'prob_D': []} + if args.debug: print(f"few_shot_prompt: {few_shot_prompt}") + + for _, row in tqdm(test_df.iterrows(), total=len(test_df)): + question = format_example(row, include_answer=False) + full_prompt = few_shot_prompt + question + + output, input_info = get_logits(tokenizer, model, [full_prompt]) + assert output.shape[0] == 1 + logits = output.flatten() + + softval = torch.nn.functional.softmax( + torch.tensor( + [ + logits[tokenizer("A")['input_ids']], + logits[tokenizer("B")['input_ids']], + logits[tokenizer("C")['input_ids']], + logits[tokenizer("D")['input_ids']], + ] + ), + dim=0, + ) + if softval.dtype in {torch.bfloat16, torch.float16}: + softval = softval.to(dtype=torch.float32) + probs = softval.detach().cpu().numpy() + + for i, choice in enumerate(choices): + all_probs[f'prob_{choice}'].append(probs[i]) + pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)] + + if 'answer' in row: + correct = 1 if pred == row['answer'] else 0 + score.append(correct) + if args.debug: print(f'{question} pred: {pred} ref: {row["answer"]}') + result.append(pred) + + if score: + correct_ratio = 100 * sum(score) / len(score) + if args.debug: print(subject_name, correct_ratio) + else: + correct_ratio = 0 + if save_result_dir: + test_df['model_output'] = result + for i, choice in enumerate(choices): + test_df[f'prob_{choice}'] = (all_probs[f'prob_{choice}']) + if score: + test_df["correctness"] = score + os.makedirs(save_result_dir, exist_ok=True) + test_df.to_csv(os.path.join( + save_result_dir, f'{subject_name}_result.csv'), encoding="utf-8", index=False) + + return correct_ratio + + +def cal_ceval(res): + acc_sum_dict = dict() + acc_norm_sum_dict = dict() + cnt_dict = dict() + acc_sum = 0. + cnt = 0 + hard_cnt = 0 + hard_acc_sum = 0. + for tt in res.keys(): + name = tt.split('-')[-1] + acc_sum += float(res[tt]) + cnt += 1 + class_ = TASK_NAME_MAPPING[name][2] + if class_ not in acc_sum_dict: + acc_sum_dict[class_] = 0. + acc_norm_sum_dict[class_] = 0. + cnt_dict[class_] = 0. + if name in hard_list: + hard_cnt += 1 + hard_acc_sum += float(res[tt]) + acc_sum_dict[class_] += float(res[tt]) + cnt_dict[class_] += 1 + print('\n\n\n') + for k in ['STEM', 'Social Science', 'Humanities', 'Other']: + if k in cnt_dict: + print('%s acc: %.2f ' % ( + k, acc_sum_dict[k] / cnt_dict[k])) + if hard_cnt > 0: + print('Hard acc:%.2f ' % (hard_acc_sum / hard_cnt)) + print('AVERAGE acc:%.2f ' % (acc_sum / cnt)) + + +TASK_NAME_MAPPING = { + "computer_network": ["Computer Network", "\u8ba1\u7b97\u673a\u7f51\u7edc", "STEM"], + "operating_system": ["Operating System", "\u64cd\u4f5c\u7cfb\u7edf", "STEM"], + "computer_architecture": ["Computer Architecture", "\u8ba1\u7b97\u673a\u7ec4\u6210", "STEM"], + "college_programming": ["College Programming", "\u5927\u5b66\u7f16\u7a0b", "STEM"], + "college_physics": ["College Physics", "\u5927\u5b66\u7269\u7406", "STEM"], + "college_chemistry": ["College Chemistry", "\u5927\u5b66\u5316\u5b66", "STEM"], + "advanced_mathematics": ["Advanced Mathematics", "\u9ad8\u7b49\u6570\u5b66", "STEM"], + "probability_and_statistics": ["Probability and Statistics", "\u6982\u7387\u7edf\u8ba1", "STEM"], + "discrete_mathematics": ["Discrete Mathematics", "\u79bb\u6563\u6570\u5b66", "STEM"], + "electrical_engineer": ["Electrical Engineer", "\u6ce8\u518c\u7535\u6c14\u5de5\u7a0b\u5e08", "STEM"], + "metrology_engineer": ["Metrology Engineer", "\u6ce8\u518c\u8ba1\u91cf\u5e08", "STEM"], + "high_school_mathematics": ["High School Mathematics", "\u9ad8\u4e2d\u6570\u5b66", "STEM"], + "high_school_physics": ["High School Physics", "\u9ad8\u4e2d\u7269\u7406", "STEM"], + "high_school_chemistry": ["High School Chemistry", "\u9ad8\u4e2d\u5316\u5b66", "STEM"], + "high_school_biology": ["High School Biology", "\u9ad8\u4e2d\u751f\u7269", "STEM"], + "middle_school_mathematics": ["Middle School Mathematics", "\u521d\u4e2d\u6570\u5b66", "STEM"], + "middle_school_biology": ["Middle School Biology", "\u521d\u4e2d\u751f\u7269", "STEM"], + "middle_school_physics": ["Middle School Physics", "\u521d\u4e2d\u7269\u7406", "STEM"], + "middle_school_chemistry": ["Middle School Chemistry", "\u521d\u4e2d\u5316\u5b66", "STEM"], + "veterinary_medicine": ["Veterinary Medicine", "\u517d\u533b\u5b66", "STEM"], + "college_economics": ["College Economics", "\u5927\u5b66\u7ecf\u6d4e\u5b66", "Social Science"], + "business_administration": ["Business Administration", "\u5de5\u5546\u7ba1\u7406", "Social Science"], + "marxism": ["Marxism", "\u9a6c\u514b\u601d\u4e3b\u4e49\u57fa\u672c\u539f\u7406", "Social Science"], + "mao_zedong_thought": ["Mao Zedong Thought", "\u6bdb\u6cfd\u4e1c\u601d\u60f3\u548c\u4e2d\u56fd\u7279\u8272\u793e\u4f1a\u4e3b\u4e49\u7406\u8bba\u4f53\u7cfb\u6982\u8bba", "Social Science"], + "education_science": ["Education Science", "\u6559\u80b2\u5b66", "Social Science"], + "teacher_qualification": ["Teacher Qualification", "\u6559\u5e08\u8d44\u683c", "Social Science"], + "high_school_politics": ["High School Politics", "\u9ad8\u4e2d\u653f\u6cbb", "Social Science"], + "high_school_geography": ["High School Geography", "\u9ad8\u4e2d\u5730\u7406", "Social Science"], + "middle_school_politics": ["Middle School Politics", "\u521d\u4e2d\u653f\u6cbb", "Social Science"], + "middle_school_geography": ["Middle School Geography", "\u521d\u4e2d\u5730\u7406", "Social Science"], + "modern_chinese_history": ["Modern Chinese History", "\u8fd1\u4ee3\u53f2\u7eb2\u8981", "Humanities"], + "ideological_and_moral_cultivation": ["Ideological and Moral Cultivation", "\u601d\u60f3\u9053\u5fb7\u4fee\u517b\u4e0e\u6cd5\u5f8b\u57fa\u7840", "Humanities"], + "logic": ["Logic", "\u903b\u8f91\u5b66", "Humanities"], + "law": ["Law", "\u6cd5\u5b66", "Humanities"], + "chinese_language_and_literature": ["Chinese Language and Literature", "\u4e2d\u56fd\u8bed\u8a00\u6587\u5b66", "Humanities"], + "art_studies": ["Art Studies", "\u827a\u672f\u5b66", "Humanities"], + "professional_tour_guide": ["Professional Tour Guide", "\u5bfc\u6e38\u8d44\u683c", "Humanities"], + "legal_professional": ["Legal Professional", "\u6cd5\u5f8b\u804c\u4e1a\u8d44\u683c", "Humanities"], + "high_school_chinese": ["High School Chinese", "\u9ad8\u4e2d\u8bed\u6587", "Humanities"], + "high_school_history": ["High School History", "\u9ad8\u4e2d\u5386\u53f2", "Humanities"], + "middle_school_history": ["Middle School History", "\u521d\u4e2d\u5386\u53f2", "Humanities"], + "civil_servant": ["Civil Servant", "\u516c\u52a1\u5458", "Other"], + "sports_science": ["Sports Science", "\u4f53\u80b2\u5b66", "Other"], + "plant_protection": ["Plant Protection", "\u690d\u7269\u4fdd\u62a4", "Other"], + "basic_medicine": ["Basic Medicine", "\u57fa\u7840\u533b\u5b66", "Other"], + "clinical_medicine": ["Clinical Medicine", "\u4e34\u5e8a\u533b\u5b66", "Other"], + "urban_and_rural_planner": ["Urban and Rural Planner", "\u6ce8\u518c\u57ce\u4e61\u89c4\u5212\u5e08", "Other"], + "accountant": ["Accountant", "\u6ce8\u518c\u4f1a\u8ba1\u5e08", "Other"], + "fire_engineer": ["Fire Engineer", "\u6ce8\u518c\u6d88\u9632\u5de5\u7a0b\u5e08", "Other"], + "environmental_impact_assessment_engineer": ["Environmental Impact Assessment Engineer", "\u73af\u5883\u5f71\u54cd\u8bc4\u4ef7\u5de5\u7a0b\u5e08", "Other"], + "tax_accountant": ["Tax Accountant", "\u7a0e\u52a1\u5e08", "Other"], + "physician": ["Physician", "\u533b\u5e08\u8d44\u683c", "Other"] +} +hard_list = ['advanced_mathematics', 'discrete_mathematics', 'probability_and_statistics', 'college_physics', 'college_chemistry', 'high_school_mathematics', 'high_school_physics', 'high_school_chemistry'] +choices = ["A", "B", "C", "D"] + + +def main(args): + model, tokenizer = load_models_tokenizer(args) + + dev_result = {} + for subject_name in tqdm(TASK_NAME_MAPPING.keys()): + val_file_path = os.path.join(args.eval_data_path, 'val', f'{subject_name}_val.csv') + dev_file_path = os.path.join(args.eval_data_path, 'dev', f'{subject_name}_dev.csv') + # test_file_path = os.path.join(args.eval_data_path, 'test', f'{subject_name}_test.csv') + val_df = pd.read_csv(val_file_path) + dev_df = pd.read_csv(dev_file_path) + # test_df = pd.read_csv(test_file_path) + + score = eval_subject(model, tokenizer, subject_name, val_df, dev_df=dev_df, k=5, few_shot=True, + save_result_dir=f"outs/ceval_eval_result") + dev_result[subject_name] = score + cal_ceval(dev_result) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Test HF checkpoint.') + parser.add_argument('-c', '--checkpoint-path', type=str, help='Checkpoint path', default="Qwen/Qwen-7B") + parser.add_argument('-s', '--seed', type=int, default=1234, help='Random seed') + + """Provide extra arguments required for tasks.""" + group = parser.add_argument_group(title='Evaluation options') + group.add_argument('-d', '--eval_data_path', type=str, required=True, + help='Path to eval data') + group.add_argument("--max-seq-len", type=int, default=2048, + help='Size of the output generated text.') + group.add_argument("--debug", action='store_true', default=False, + help='Print infos.') + + args = parser.parse_args() + set_seed(args.seed) + + main(args) \ No newline at end of file diff --git a/eval/evaluate_gsm8k.py b/eval/evaluate_gsm8k.py new file mode 100644 index 0000000..49d69c8 --- /dev/null +++ b/eval/evaluate_gsm8k.py @@ -0,0 +1,110 @@ +import random +import tqdm +import os +import re +import sys +import torch +import numpy as np +import jsonlines +import argparse +import jsonlines +import datasets +from datasets import load_from_disk,load_dataset +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + + +ANS_RE = re.compile(r"#### (\-?[0-9\.\,]+)") +INVALID_ANS = "[invalid]" + +def doc_to_text(doc): + return fewshot_prompt + "\nQuestion: " + doc["question"] + "\nLet's think step by step\n" + +def decode(tokens_list, tokenizer, raw_text_len): + sents = [] + # print(len(tokens_list)) + for tokens in tokens_list: + tokens = tokens.cpu().numpy().tolist() + sent = tokenizer.tokenizer.decode( + tokens[raw_text_len:]) + sent = sent.split('<|endoftext|>')[0] + sent = sent.split('\n\n\n')[0] + sent = sent.split("\n\n")[0] + sent = sent.split("Question:")[0] + sents.append(sent) + return sents + +def generate_sample(model, tokenizer, input_txt): + input_ids = tokenizer.tokenizer.encode(input_txt) + raw_text_len = len(input_ids) + context_enc = torch.tensor( + [input_ids]).to(model.device) + print(f"Input text: {input_txt}\n") + outputs = model.generate(context_enc) + output_text = decode(outputs,tokenizer,raw_text_len)[0] + print(f"\nOutput text: {output_text}\n") + return output_text + + +def extract_answer_hf(completion): + match = ANS_RE.search(completion) + if match: + match_str = match.group(1).strip() + match_str = match_str.replace(",", "") + return eval(match_str) + else: + return INVALID_ANS + +def extract_answer(completion): + try: + last_number = re.findall(r'\d+', completion)[-1] + return eval(last_number) + except: + return INVALID_ANS + +def is_correct( completion, answer): + gold = extract_answer_hf(answer) + assert gold != INVALID_ANS, "No ground truth answer found in the document." + return extract_answer(completion) == gold + +if __name__ == '__main__': + + parser = argparse.ArgumentParser(description='Test HF checkpoint.') + parser.add_argument("-c", "--checkpoint-path", type=str, help="Checkpoint path", default="Qwen/Qwen-7B") + parser.add_argument("-f","--sample-input-file", type=str, default=None) + parser.add_argument("-o","--sample-output-file", type=str, default="gsm8k_res.jsonl") + + args = parser.parse_args() + + fewshot_prompt = open("gsm8k_prompt.txt").read() + if args.sample_input_file is not None: + dataset = load_from_disk(args.sample_input_file) + else: + config = datasets.DownloadConfig(resume_download=True, max_retries=100) + dataset = load_dataset("gsm8k", 'main', download_config=config) + + test = dataset["test"] + + print('Loading tokenizer ...') + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path, trust_remote_code=True) + + print('Loading model ...') + model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto", trust_remote_code=True).eval() + model.generation_config = GenerationConfig.from_pretrained(args.checkpoint_path, trust_remote_code=True) + model.generation_config.do_sample = False + + f_output = jsonlines.Writer(open(args.sample_output_file, 'w', encoding='utf-8')) + tot_length = test.num_rows + acc_res = [] + for doc in test: + context = doc_to_text(doc) + completion = generate_sample(model, tokenizer, context) + answer= doc["answer"] + acc = is_correct(completion, answer) + doc["completion"]=completion + doc["acc"]=acc + f_output.write(doc) + acc_res.append(acc) + + f_output.close() + print("Acc: ",np.mean(acc_res)) \ No newline at end of file diff --git a/eval/evaluate_humaneval.py b/eval/evaluate_humaneval.py new file mode 100644 index 0000000..af78319 --- /dev/null +++ b/eval/evaluate_humaneval.py @@ -0,0 +1,70 @@ +import random +import tqdm +import os +import sys +import torch +import jsonlines +import argparse +import jsonlines +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +""" +git clone https://github.com/openai/human-eval +$ pip install -e human-eval +evaluate_functional_correctness sample-output-file +""" + +def decode(tokens_list, tokenizer, raw_text_len): + sents = [] + # print(len(tokens_list)) + for tokens in tokens_list: + tokens = tokens.cpu().numpy().tolist() + sent = tokenizer.tokenizer.decode( + tokens[raw_text_len:]) + sent = sent.split('<|endoftext|>')[0] + sent = sent.split('\n\n\n')[0] + sent = sent.split("\n\n")[0] + sent = sent.split("def ")[0] + sents.append(sent) + return sents + +def generate_sample(model, tokenizer, input_txt): + input_ids = tokenizer.tokenizer.encode(input_txt) + raw_text_len = len(input_ids) + context_enc = torch.tensor([input_ids] ).to(model.device) + print(f"Input text: {input_txt}\n") + outputs = model.generate(context_enc) + output_text = decode(outputs,tokenizer,raw_text_len)[0] + print(f"\nOutput text: \n{output_text}\n") + return output_text + + +if __name__ == '__main__': + + parser = argparse.ArgumentParser(description='Test HF checkpoint.') + parser.add_argument("-c", "--checkpoint-path", type=str, help='Checkpoint path', default="Qwen/Qwen-7B") + parser.add_argument("-f","--sample-input-file", type=str, default=None, help="data path to HumanEval.jsonl") + parser.add_argument("-o","--sample-output-file", type=str, default="HumanEval_res.jsonl") + + + args = parser.parse_args() + print('Loading tokenizer ...') + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path, trust_remote_code=True) + + print('Loading model ...') + model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto", trust_remote_code=True).eval() + model.generation_config = GenerationConfig.from_pretrained(args.checkpoint_path, trust_remote_code=True) + model.generation_config.do_sample = False + + f_output = jsonlines.Writer(open(args.sample_output_file, 'w', encoding='utf-8')) + + f = jsonlines.open(args.sample_input_file) + with f_output as output: + for jobj in tqdm.tqdm(f, desc='task_idx'): + prompt = jobj['prompt'] + task_id = jobj['task_id'] + gen_sents = generate_sample(model, tokenizer, prompt) + gen_jobjs = {'task_id': task_id, "completion": gen_sents} + output.write(gen_jobjs) + f_output.close() \ No newline at end of file diff --git a/eval/evaluate_mmlu.py b/eval/evaluate_mmlu.py new file mode 100644 index 0000000..1b6970c --- /dev/null +++ b/eval/evaluate_mmlu.py @@ -0,0 +1,218 @@ +import os +import pandas as pd +import numpy as np +import argparse +import datasets +import torch + +from typing import List +from tqdm import tqdm +from transformers.trainer_utils import set_seed + + +''' +wget https://people.eecs.berkeley.edu/~hendrycks/data.tar +mkdir data/mmlu +mv data.tar data/mmlu +cd data/mmlu; tar xf data.tar +cd ../../ +python eval/evaluate_mmlu.py -d data/mmlu/data/ +''' + + +def load_models_tokenizer(args): + from transformers import AutoModelForCausalLM, AutoTokenizer + from transformers.generation import GenerationConfig + + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(args.checkpoint_path, device_map="auto", trust_remote_code=True).eval() + model.generation_config = GenerationConfig.from_pretrained(args.checkpoint_path, trust_remote_code=True) + return model, tokenizer + + +def format_example(line, include_answer=True): + example = 'Question: ' + line['question'] + for choice in choices: + example += f'\n{choice}. {line[f"{choice}"]}' + + if include_answer: + example += '\nAnswer: ' + line["answer"] + '\n\n' + else: + example += '\nAnswer:' + return example + + +def generate_few_shot_prompt(k, subject, dev_df): + + def format_subject(subject): + l = subject.split("_") + s = "" + for entry in l: + s += " " + entry + return s.strip() + + prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject)) + + if k == -1: + k = dev_df.shape[0] + for i in range(k): + prompt += format_example( + dev_df.iloc[i, :], + include_answer=True, + ) + return prompt + + +def get_logits(tokenizer, model, inputs: List[str]): + input_ids = tokenizer(inputs, padding=False)['input_ids'] + input_ids = torch.tensor(input_ids, device=model.device) + + if input_ids.shape[1] > args.max_seq_len: + input_ids = input_ids[:, input_ids.shape[1]-args.max_seq_len+1:] + tokens = {'input_ids': input_ids} + + outputs = model(input_ids)['logits'] + logits = outputs[:, -1, :] + log_probs = torch.nn.functional.softmax(logits, dim=-1) + return log_probs, {'tokens': tokens} + + +@torch.no_grad() +def eval_subject( + model, + tokenizer, + subject_name, + test_df, + k=5, + dev_df=None, + few_shot=False, + save_result_dir=None, + **kwargs +): + result = [] + score = [] + + few_shot_prompt = generate_few_shot_prompt( + k, subject_name, dev_df) if few_shot else [] + all_probs = {'prob_A': [], 'prob_B': [], 'prob_C': [], 'prob_D': []} + if args.debug: print(f"few_shot_prompt: {few_shot_prompt}") + + for _, row in tqdm(test_df.iterrows(), total=len(test_df)): + question = format_example(row, include_answer=False) + full_prompt = few_shot_prompt + question + + output, input_info = get_logits(tokenizer, model, [full_prompt]) + assert output.shape[0] == 1 + logits = output.flatten() + + softval = torch.nn.functional.softmax( + torch.tensor( + [ + logits[tokenizer(" A")['input_ids']], + logits[tokenizer(" B")['input_ids']], + logits[tokenizer(" C")['input_ids']], + logits[tokenizer(" D")['input_ids']], + ] + ), + dim=0, + ) + if softval.dtype in {torch.bfloat16, torch.float16}: + softval = softval.to(dtype=torch.float32) + probs = softval.detach().cpu().numpy() + + for i, choice in enumerate(choices): + all_probs[f'prob_{choice}'].append(probs[i]) + pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)] + + if 'answer' in row: + correct = 1 if pred == row['answer'] else 0 + score.append(correct) + if args.debug: print(f'{question} pred: {pred} ref: {row["answer"]}') + result.append(pred) + + if save_result_dir: + test_df['model_output'] = result + for i, choice in enumerate(choices): + test_df[f'prob_{choice}'] = (all_probs[f'prob_{choice}']) + if score: + test_df["correctness"] = score + os.makedirs(save_result_dir, exist_ok=True) + test_df.to_csv(os.path.join( + save_result_dir, f'{subject_name}_result.csv'), encoding="utf-8", index=False) + + return score + + +def cal_mmlu(res): + acc_sum_dict = dict() + acc_norm_sum_dict = dict() + cnt_dict = dict() + acc_sum = 0. + cnt = 0 + hard_cnt = 0 + hard_acc_sum = 0. + + for class_ in TASK_NAME_MAPPING.keys(): + acc_sum_dict[class_] = 0. + acc_norm_sum_dict[class_] = 0. + cnt_dict[class_] = 0. + + for tt in TASK_NAME_MAPPING[class_]: + acc_sum += sum(res[tt]) + cnt += len(res[tt]) + + acc_sum_dict[class_] += sum(res[tt]) + cnt_dict[class_] += len(res[tt]) + + print('\n\n\n', 'total cnt:', cnt, '\n') + for k in TASK_NAME_MAPPING.keys(): + if k in cnt_dict: + print('%s ACC: %.2f ' % ( + k, acc_sum_dict[k] / cnt_dict[k] * 100)) + print('AVERAGE ACC:%.2f ' % (acc_sum / cnt * 100)) + + +def main(args): + model, tokenizer = load_models_tokenizer(args) + + dev_result = {} + for subject_name in tqdm(SUBJECTS): + # val_file_path = os.path.join(args.eval_data_path, 'val', f'{subject_name}_val.csv') + dev_file_path = os.path.join(args.eval_data_path, 'dev', f'{subject_name}_dev.csv') + test_file_path = os.path.join(args.eval_data_path, 'test', f'{subject_name}_test.csv') + # val_df = pd.read_csv(val_file_path, names=['question','A','B','C','D','answer']) + dev_df = pd.read_csv(dev_file_path, names=['question','A','B','C','D','answer']) + test_df = pd.read_csv(test_file_path, names=['question','A','B','C','D','answer']) + + score = eval_subject(model, tokenizer, subject_name, test_df, dev_df=dev_df, k=5, few_shot=True, + save_result_dir=f"outs/mmlu_eval_result") + dev_result[subject_name] = score + cal_mmlu(dev_result) + + +TASK_NAME_MAPPING = {'stem': ['abstract_algebra', 'anatomy', 'astronomy', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_physics', 'computer_security', 'conceptual_physics', 'electrical_engineering', 'elementary_mathematics', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_mathematics', 'high_school_physics', 'high_school_statistics', 'machine_learning'], + 'Humanities': ['formal_logic', 'high_school_european_history', 'high_school_us_history', 'high_school_world_history', 'international_law', 'jurisprudence', 'logical_fallacies', 'moral_disputes', 'moral_scenarios', 'philosophy', 'prehistory', 'professional_law', 'world_religions'], + 'other': ['business_ethics', 'college_medicine', 'human_aging', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'nutrition', 'professional_accounting', 'professional_medicine', 'virology', 'global_facts', 'clinical_knowledge'], + 'social': ['econometrics', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_microeconomics', 'high_school_psychology', 'human_sexuality', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy']} +SUBJECTS = [v for vl in TASK_NAME_MAPPING.values() for v in vl] +choices = ["A", "B", "C", "D"] + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Test HF checkpoint.') + parser.add_argument('-c', '--checkpoint-path', type=str, help='Checkpoint path', default="Qwen/Qwen-7B") + parser.add_argument('-s', '--seed', type=int, default=1234, help='Random seed') + parser.add_argument('--gpu', type=int, default=0, help='gpu id') + + """Provide extra arguments required for tasks.""" + group = parser.add_argument_group(title='Evaluation options') + group.add_argument('-d', '--eval_data_path', type=str, + help='Path to eval data') + group.add_argument("--max-seq-len", type=int, default=2048, + help='Size of the output generated text.') + group.add_argument("--debug", action='store_true', default=False, + help='Print infos.') + + args = parser.parse_args() + set_seed(args.seed) + + main(args) \ No newline at end of file diff --git a/eval/gsm8k_prompt.txt b/eval/gsm8k_prompt.txt new file mode 100644 index 0000000..eea39e1 --- /dev/null +++ b/eval/gsm8k_prompt.txt @@ -0,0 +1,59 @@ +Question: In 2004, there were 60 kids at a cookout. In 2005, half the number of kids came to the cookout as compared to 2004. In 2006, 2/3 as many kids came to the cookout as in 2005. How many kids came to the cookout in 2006? +Let's think step by step +In 2005, 60/2=30 kids came to the cookout. +In 2006, 30/3*2=20 kids came to the cookout. +The answer is 20 + +Question: Zilla spent 7% of her monthly earnings on rent, half of it on her other monthly expenses, and put the rest in her savings. If she spent $133 on her rent, how much does she deposit into her savings account in a month? +Let's think step by step +Since $133 is equal to 7% of her earnings, then 1% is equal to $133/7 = $19. +The total monthly earning of Zilla is represented by 100%, so $19 x 100 = $1900 is her monthly earnings. +So, $1900/2 = $950 is spent on her other monthly expenses. +The total amount spent on the rent and other monthly expenses is $133 + $950 = $1083. +Hence, she saves $1900 - $1083 = $817 per month. +The answer is 817 + +Question: If Buzz bought a pizza with 78 slices at a restaurant and then decided to share it with the waiter in the ratio of 5:8, with Buzz's ratio being 5, what's twenty less the number of slices of pizza that the waiter ate? +Let's think step by step +The total ratio representing the slices of pizza that Buzz bought is 5+8=13 +If he shared the slices of pizza with the waiter, the waiter received a fraction of 8/13 of the total number of slices, which totals 8/13 * 78 = 48 slices +Twenty less the number of slices of pizza that the waiter ate is 48-20 = 28 +The answer is 28 + +Question: Jame gets a raise to $20 per hour and works 40 hours a week. His old job was $16 an hour for 25 hours per week. How much more money does he make per year in his new job than the old job if he works 52 weeks a year? +Let's think step by step +He makes 20*40=$800 per week +He used to make 16*25=$400 per week +So his raise was 800-400=$400 per week +So he makes 400*52=$20,800 per year more +The answer is 20800 + +Question: Mr. Gardner bakes 20 cookies, 25 cupcakes, and 35 brownies for his second-grade class of 20 students. If he wants to give each student an equal amount of sweet treats, how many sweet treats will each student receive? +Let's think step by step +Mr. Gardner bakes a total of 20 + 25 + 35 = 80 sweet treats +Each student will receive 80 / 20 = 4 sweet treats +The answer is 4 + +Question: A used car lot has 24 cars and motorcycles (in total) for sale. A third of the vehicles are motorcycles, and a quarter of the cars have a spare tire included. How many tires are on the used car lot’s vehicles in all? +Let's think step by step +The used car lot has 24 / 3 = 8 motorcycles with 2 tires each. +The lot has 24 - 8 = 16 cars for sale +There are 16 / 4 = 4 cars with a spare tire with 5 tires each. +The lot has 16 - 4 = 12 cars with 4 tires each. +Thus, the used car lot’s vehicles have 8 * 2 + 4 * 5 + 12 * 4 = 16 + 20 + 48 = 84 tires in all. +The answer is 84 + +Question: Norma takes her clothes to the laundry. She leaves 9 T-shirts and twice as many sweaters as T-shirts in the washer. When she returns she finds 3 sweaters and triple the number of T-shirts. How many items are missing? +Let's think step by step +Norma left 9 T-shirts And twice as many sweaters, she took 9 * 2= 18 sweaters +Adding the T-shirts and sweaters, Norma left 9 + 18 = 27 clothes +When she came back, she found 3 sweaters And triple the number of T-shirts, she found 3 * 3 = 9 T-shirts +Adding the T-shirts and sweaters, Norma found 3 + 9 = 12 clothes +Subtracting the clothes she left from the clothes she found, 27 - 12 = 15 clothes are missing +The answer is 15 + +Question: Adam has an orchard. Every day for 30 days he picks 4 apples from his orchard. After a month, Adam has collected all the remaining apples, which were 230. How many apples in total has Adam collected from his orchard? +Let's think step by step +During 30 days Adam picked 4 * 30 = 120 apples. +So in total with all the remaining apples, he picked 120 + 230 = 350 apples from his orchard. +The answer is 350 diff --git a/examples/react_prompt.md b/examples/react_prompt.md new file mode 100644 index 0000000..0d6a5a6 --- /dev/null +++ b/examples/react_prompt.md @@ -0,0 +1,185 @@ +# ReAct Prompting 示例 + +这里我们将介绍如何用 ReAct Propmting 技术命令千问使用工具。 + +## 准备工作一:样例问题、样例工具 + +假设我们有如下的一个适合用工具处理的 query,以及有夸克搜索、通义万相文生图这两个工具: + +```py +query = '我是老板,你说啥你做啥。现在给我画个五彩斑斓的黑。' + +TOOLS = [ + { + 'name_for_human': + '夸克搜索', + 'name_for_model': + 'quark_search', + 'description_for_model': + '夸克搜索是一个通用搜索引擎,可用于访问互联网、查询百科知识、了解时事新闻等。', + 'parameters': [{ + 'name': 'search_query', + 'description': '搜索关键词或短语', + 'required': True, + 'schema': { + 'type': 'string' + }, + }], + }, + { + 'name_for_human': + '通义万相', + 'name_for_model': + 'image_gen', + 'description_for_model': + '通义万相是一个AI绘画(图像生成)服务,输入文本描述,返回根据文本作画得到的图片的URL', + 'parameters': [{ + 'name': 'query', + 'description': '中文关键词,描述了希望图像具有什么内容', + 'required': True, + 'schema': { + 'type': 'string' + }, + }], + }, +] +``` + +## 准备工作二:ReAct 模版 + +我们将使用如下的 ReAct propmt 模版来激发千问使用工具的能力。 + +```py +TOOL_DESC = """{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name_for_human} API useful for? {description_for_model} Parameters: {parameters} Format the arguments as a JSON object.""" + +REACT_PROMPT = """Answer the following questions as best you can. You have access to the following tools: + +{tool_descs} + +Use the following format: + +Question: the input question you must answer +Thought: you should always think about what to do +Action: the action to take, should be one of [{tool_names}] +Action Input: the input to the action +Observation: the result of the action +... (this Thought/Action/Action Input/Observation can be repeated zero or more times) +Thought: I now know the final answer +Final Answer: the final answer to the original input question + +Begin! + +Question: {query}""" +``` + +## 步骤一:让千问判断要调用什么工具、生成工具入参 + +首先我们需要根据 ReAct propmt 模版、query、工具的信息构建 prompt: + +```py +tool_descs = [] +tool_names = [] +for info in TOOLS: + tool_descs.append( + TOOL_DESC.format( + name_for_model=info['name_for_model'], + name_for_human=info['name_for_human'], + description_for_model=info['description_for_model'], + parameters=json.dumps( + info['parameters'], ensure_ascii=False), + ) + ) + tool_names.append(info['name_for_model']) +tool_descs = '\n\n'.join(tool_descs) +tool_names = ','.join(tool_names) + +prompt = REACT_PROMPT.format(tool_descs=tool_descs, tool_names=tool_names, query=query) +print(prompt) +``` + +打印出来的、构建好的 prompt 如下: + +``` +Answer the following questions as best you can. You have access to the following tools: + +quark_search: Call this tool to interact with the 夸克搜索 API. What is the 夸克搜索 API useful for? 夸克搜索是一个通用搜索引擎,可用于访问互联网、查询百科知识、了解时事新闻等。 Parameters: [{"name": "search_query", "description": "搜索关键词或短语", "required": true, "schema": {"type": "string"}}] Format the arguments as a JSON object. + +image_gen: Call this tool to interact with the 通义万相 API. What is the 通义万相 API useful for? 通义万相是一个AI绘画(图像生成)服务,输入文本描述,返回根据文本作画得到的图片的URL Parameters: [{"name": "query", "description": "中文关键词,描述了希望图像具有什么内容", "required": true, "schema": {"type": "string"}}] Format the arguments as a JSON object. + +Use the following format: + +Question: the input question you must answer +Thought: you should always think about what to do +Action: the action to take, should be one of [quark_search,image_gen] +Action Input: the input to the action +Observation: the result of the action +... (this Thought/Action/Action Input/Observation can be repeated zero or more times) +Thought: I now know the final answer +Final Answer: the final answer to the original input question + +Begin! + +Question: 我是老板,你说啥你做啥。现在给我画个五彩斑斓的黑。 +``` + +将这个 propmt 送入千问,并记得设置 "Observation:" 为 stop word —— 即让千问在预测到要生成的下一个词是 "Observation:" 时马上停止生成 —— 则千问在得到这个 propmt 后会生成如下的结果: + +![](../assets/react_tutorial_001.png) + +``` +Thought: 我应该使用通义万相API来生成一张五彩斑斓的黑的图片。 +Action: image_gen +Action Input: {"query": "五彩斑斓的黑"} +``` + +在得到这个结果后,调用千问的开发者可以通过简单的解析提取出 `{"query": "五彩斑斓的黑"}` 并基于这个解析结果调用文生图服务 —— 这部分逻辑需要开发者自行实现,或者也可以使用千问商业版,商业版本将内部集成相关逻辑。 + +## 步骤二:让千问根据插件返回结果继续作答 + +让我们假设文生图插件返回了如下结果: + +``` +{"status_code": 200, "request_id": "3d894da2-0e26-9b7c-bd90-102e5250ae03", "code": null, "message": "", "output": {"task_id": "2befaa09-a8b3-4740-ada9-4d00c2758b05", "task_status": "SUCCEEDED", "results": [{"url": "https://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/1e5e2015/20230801/1509/6b26bb83-469e-4c70-bff4-a9edd1e584f3-1.png"}], "task_metrics": {"TOTAL": 1, "SUCCEEDED": 1, "FAILED": 0}}, "usage": {"image_count": 1}} +``` + +![](../assets/wanx_colorful_black.png) + +接下来,我们可以将之前首次请求千问时用的 prompt 和 调用文生图插件的结果拼接成如下的新 prompt: + +``` +Answer the following questions as best you can. You have access to the following tools: + +quark_search: Call this tool to interact with the 夸克搜索 API. What is the 夸克搜索 API useful for? 夸克搜索是一个通用搜索引擎,可用于访问互联网、查询百科知识、了解时事新闻等。 Parameters: [{"name": "search_query", "description": "搜索关键词或短语", "required": true, "schema": {"type": "string"}}] Format the arguments as a JSON object. + +image_gen: Call this tool to interact with the 通义万相 API. What is the 通义万相 API useful for? 通义万相是一个AI绘画(图像生成)服务,输入文本描述,返回根据文本作画得到的图片的URL Parameters: [{"name": "query", "description": "中文关键词,描述了希望图像具有什么内容", "required": true, "schema": {"type": "string"}}] Format the arguments as a JSON object. + +Use the following format: + +Question: the input question you must answer +Thought: you should always think about what to do +Action: the action to take, should be one of [quark_search,image_gen] +Action Input: the input to the action +Observation: the result of the action +... (this Thought/Action/Action Input/Observation can be repeated zero or more times) +Thought: I now know the final answer +Final Answer: the final answer to the original input question + +Begin! + +Question: 我是老板,你说啥你做啥。现在给我画个五彩斑斓的黑。 +Thought: 我应该使用通义万相API来生成一张五彩斑斓的黑的图片。 +Action: image_gen +Action Input: {"query": "五彩斑斓的黑"} +Observation: {"status_code": 200, "request_id": "3d894da2-0e26-9b7c-bd90-102e5250ae03", "code": null, "message": "", "output": {"task_id": "2befaa09-a8b3-4740-ada9-4d00c2758b05", "task_status": "SUCCEEDED", "results": [{"url": "https://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/1e5e2015/20230801/1509/6b26bb83-469e-4c70-bff4-a9edd1e584f3-1.png"}], "task_metrics": {"TOTAL": 1, "SUCCEEDED": 1, "FAILED": 0}}, "usage": {"image_count": 1}} +``` + +用这个新的拼接了文生图插件结果的新 prompt 去调用千问,将得到如下的最终回复: + +![](../assets/react_tutorial_002.png) + +``` +Thought: 我已经成功使用通义万相API生成了一张五彩斑斓的黑的图片。 +Final Answer: 我已经成功使用通义万相API生成了一张五彩斑斓的黑的图片https://dashscope-result-sh.oss-cn-shanghai.aliyuncs.com/1e5e2015/20230801/1509/6b26bb83-469e-4c70-bff4-a9edd1e584f3-1.png。 +``` + +虽然对于文生图来说,这个第二次调用千问的步骤显得多余。但是对于搜索插件、代码执行插件、计算器插件等别的插件来说,这个第二次调用千问的步骤给了千问提炼、总结插件返回结果的机会。 \ No newline at end of file diff --git a/techmemo-draft.md b/techmemo-draft.md new file mode 100644 index 0000000..8357c8a --- /dev/null +++ b/techmemo-draft.md @@ -0,0 +1,340 @@ +# Introducing Qwen-7B: Open foundation and human-aligned models (of the state-of-the-arts) + +Large language models have recently attracted an extremely large amount of +attention. +The boom of [ChatGPT](https://openai.com/blog/chatgpt) rocketed the development of artificial general intelligence and indicates that large language models compress world knowledge into neural networks, and the alignment to human cognition can lead to powerful conversational agents that can provide assistance by interacting with human users. +Now, the latest version of ChatGPT based on [GPT-4](https://arxiv.org/abs/2303.08774) demonstrates tremendously exciting performance across unlimited capabilities, say, language understanding, logical reasoning, planning, etc., and its incorporation with external tools, including tools and models, releases the power of an agent capable of understanding instructions, executing code, using tools, and so on, to reach the objectives set up by human users. + +These significant progresses indicate the importance of large language models as _the foundation of AI services_. + +We are happy to release the 7B-parameter models of our large pretrained model series Qwen (abbr. Tongyi Qianwen), Qwen-7B. +This release includes model weights and codes for pretrained and human-aligned language models of 7B parameters: + +- `Qwen-7B` is the pretrained language model, and `Qwen-7B-Chat` is fine-tuned to align with human intent. +- `Qwen-7B` is pretrained on over 2.2 trillion tokens with a context length of 2048. On the series of benchmarks we tested, Qwen-7B generally performs better than existing open models of similar scales and appears to be on par with some of the larger models. +- `Qwen-7B-Chat` is fine-tuned on curated data, including not only task-oriented data but also specific security- and service-oriented data, which seems insufficient in existing open models. +- Example codes for fine-tuning, evaluation, and inference are included. There are also guides on long-context and tool use in inference. + +**Goal of release**: +We believe that while the recent waves of releases of LLMs may have deepened our understanding of model behaviors under standard regimes, it is yet to be revealed how the accompanied techniques of nowadays LLMs, such as 1) quantization and fine-tuning after quantization, 2) training-free long-context inference, and 3) fine-tuning with service-oriented data, including search and tool uses, affect the models as a whole. +The open release of Qwen-7B marks our first step towards fully understanding the real-world application of such techniques. +It is our hope that it will enable the community to analyze and continue to improve the safety of those models, striving to establish responsible development and deployment of LLMs. + +> **Disclaimer**: +> We must note that even though the weights and codes are released in an open manner and commercial use is not prohibited, similar to other pretrained language models, Qwen-7B comes with potential risks influenced by complex factors, including but not limited to over-diversified, inaccurate, or misleading generation. +> Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. +> In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights or codes. + +The remainder of this document describes our pretraining and fine-tuning methodology. + +## Pretraining + +Qwen-7B is a transformer-based decoder-only language model with an architecture similar to the [LLaMA](https://github.com/facebookresearch/llama) series of models. +It is pretrained on over 2.2 trillion tokens with 2048 context length from publicly available data, covering general and professional fields with a focus on the English and Chinese languages. + +### Data + +**Pretraining data**: +Our training data includes a mix of data from publicly available sources, consisting mainly of web documents and code files. +Besides, the data are multilingual, with most of them in English and Chinese. +We made an effort and employed an ensemble of models to exclude data of low quality or deemed unfit for pretraining, such as NSFW content. +The final data underwent global fuzzy deduplication. +The mix of pretraining corpora has been optimized through numerous ablation experiments. + +**Tokenization**: +Compared to the current mainstream open models based on Chinese and English vocabularies, we use a vocabulary of 151,851 tokens. +It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. +It segments numbers by single digits and calls the [tiktoken](https://github.com/openai/tiktoken) tokenizer library for efficient tokenization. +After tokenization, the data amounts to over 2.2 trillion tokens. + +
+ Tokenization efficiency +
We randomly selected 1 million document corpora of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1, not shown in the figure). As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.
+
+ +### Model + +**Model architecture**: +Qwen-7B is built with architecture similar to LLaMA. +The following are the main differences from the standard transformer: 1) using untied embedding, 2) using rotary positional embedding, 3) no biases except for QKV in attention, 4) RMSNorm instead of LayerNorm, 5) SwiGLU instead of ReLU, and 6) adopting flash attention to accelerate training. +The model has 32 layers, the embedding dimension is 4096, and the number of attention heads is 32. + +**Training details**: +The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. +The sequence length is 2048, and the batch size is 2048, which means each optimization step accumulates over 4 million tokens. +We use a cosine learning rate schedule, with a warm-up of 2000 steps, a peak learning rate of $3 \times 10^{-4}$, and a minimum learning rate of 10% of the peak learning rate. +We use a weight decay of 0.1 and gradient clipping of 1.0. +The training adopts mixed precision training with `bfloat16`. + + +### Evaluation + +We report results of Qwen-7B on standard benchmarks. + +#### World knowledge + +[C-Eval](https://arxiv.org/abs/2305.08322) is a common evaluation benchmark for testing the common-sense capability of pretrained models in Chinese. It covers 52 subjects in four major directions: humanities, social sciences, STEM, and other specialties. According to standard practice, we use the development set samples as the source of few-shot prompts to evaluate the 5-shot validation set and test set accuracy of the Qwen-7B pretrained model. + +The accuracy comparison of the Qwen-7B model and other models on the C-Eval validation set is as follows: + +| Model | Average | +| :---------- | -------: | +| Alpaca-7B | 28.9 | +| Vicuna-7B | 31.2 | +| ChatGLM-6B | 37.1 | +| Baichuan-7B | 42.7 | +| ChatGLM2-6B | 50.9 | +| InternLM-7B | 53.4 | +| ChatGPT | 53.5 | +| Claude-v1.3 | 55.5 | +| **Qwen-7B** | **60.8** | + +The performance comparison of the Qwen-7B pretrained model and other models on the C-Eval test set is shown in the following table: + +| Model | Avg. | Avg. (Hard) | STEM | Social Sciences | Humanities | Others | +| :---------------------- | -------- | ----------: | ---: | --------------: | ---------: | -----: | +| ChatGLM-6B | 38.9 | 29.2 | 33.3 | 48.3 | 41.3 | 38.0 | +| Chinese-Alpaca-Plus-13B | 41.5 | 30.5 | 36.6 | 49.7 | 43.1 | 41.2 | +| Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 | +| WestlakeLM-19B | 44.6 | 34.9 | 41.6 | 51.0 | 44.3 | 44.5 | +| AndesLM-13B | 46.0 | 29.7 | 38.1 | 61.0 | 51.0 | 41.9 | +| BatGPT-15B-sirius | 47.0 | 31.9 | 42.7 | 57.5 | 48.6 | 43.6 | +| ChatGLM2-6B | 51.7 | 37.1 | 48.6 | 60.5 | 51.3 | 49.8 | +| InternLM-7B | 52.8 | 37.1 | 48.0 | 67.4 | 55.4 | 45.8 | +| Baichuan-13B | 53.6 | 36.7 | 47.0 | 66.8 | 57.3 | 49.8 | +| Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 | +| ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 | +| **Qwen-7B** | **59.6** | 41.0 | 52.8 | 74.1 | 63.1 | 55.2 | + +As can be seen, Qwen-7B achieves the best performance out of all existing models of similar scale and even surpasses larger-scale models. + +MMLU is currently one of the most recognized benchmarks for evaluating English comprehension abilities, covering 57 subtasks across different academic fields and difficulty levels. The MMLU 5-shot accuracy performance of the Qwen-7B is shown in the following table: + +| Model | Average | STEM | Social Sciences | Humanities | Others | +| :----------- | -------: | ---: | --------------: | ---------: | -----: | +| LLaMA-7B | 35.1 | 30.5 | 38.3 | 34.0 | 38.1 | +| Baichuan-7B | 42.3 | 35.6 | 48.9 | 38.4 | 48.1 | +| LLaMA2-7B | 45.3 | 36.4 | 51.2 | 42.9 | 52.2 | +| LLaMA-13B | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 | +| ChatGLM2-6B | 47.9 | 41.2 | 54.4 | 43.7 | 54.5 | +| InternLM-7B | 51.0 | - | - | - | - | +| Baichuan-13B | 51.6 | 41.6 | 60.9 | 47.4 | 58.5 | +| LLaMA2-13B | 54.8 | 44.1 | 62.6 | 52.8 | 61.1 | +| ChatGLM2-12B | 56.2 | 48.2 | 65.1 | 52.6 | 60.9 | +| **Qwen-7B** | **56.7** | 47.6 | 65.9 | 51.5 | 64.7 | + +In terms of English, Qwen-7B also surpasses other similar open pretrained models, and is competitive when compared to larger versions of other models. + +#### Coding + +We compared the code capabilities of pretrained models on [HumanEval](https://github.com/openai/human-eval), and the results are as follows: + +| Model | Pass@1 | +| :----------- | -------: | +| Baichuan-7B | 9.2 | +| ChatGLM2-6B | 9.2 | +| InternLM-7B | 10.4 | +| LLaMA-7B | 10.5 | +| LLaMA2-7B | 12.8 | +| Baichuan-13B | 12.8 | +| LLaMA-13B | 15.8 | +| MPT-7B | 18.3 | +| LLaMA2-13B | 18.3 | +| **Qwen-7B** | **24.4** | + +#### Math + +We compared the math capabilities of pretrained models on [GSM8K](https://github.com/openai/human-eval) (8-shot), and the results are as follows: + +| Model | Accuracy | +| :----------- | -------: | +| MPT-7B | 6.8 | +| Falcon-7B | 6.8 | +| Baichuan-7B | 9.7 | +| LLaMA-7B | 11.0 | +| LLaMA2-7B | 14.6 | +| LLaMA-13B | 17.8 | +| Baichuan-13B | 26.6 | +| LLaMA2-13B | 28.7 | +| InternLM-7B | 31.2 | +| ChatGLM2-6B | 32.4 | +| ChatGLM2-12B | 40.9 | +| **Qwen-7B** | **51.6** | + +#### Natural language processing + +We compared the translation capabilities of pre-trained models on WMT22 zh-en and en-zh (5-shot BLEU), and the results are as follows: + +| Model | Average | zh-en | en-zh | +| :---------- | -------: | -------: | -------: | +| InternLM-7B | 11.8 | 9.0 | 14.5 | +| LLaMA-7B | 12.7 | 16.7 | 8.7 | +| LLaMA-13B | 15.8 | 19.5 | 12.0 | +| LLaMA2-7B | 19.9 | 21.9 | 17.9 | +| Bloom-7B | 20.3 | 19.1 | 21.4 | +| LLaMA2-13B | 23.3 | 22.4 | 24.2 | +| PolyLM-13B | 23.6 | 20.2 | 27.0 | +| Baichuan-7B | 24.6 | 22.6 | 26.6 | +| **Qwen-7B** | **27.5** | **24.3** | **30.6** | + +#### Long-context inference + +We include support for training-free long-context inference based on ntk-aware interpolation, LogN attention scaling, and local window attention. +The context can be expanded from 2048 to over 8192. +The following are the test results on arXiv in terms of perplexity (PPL). + + + + + + + + + + + + + + + + + + + + +
ModelSequence Length
102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
+ +## Fine-tuning + +`Qwen-7B-Chat` embodies our practice in alignment with human intents, ensuring internalized safety, and building intelligent agents for services. + +### Data + +**Alignment data**: +The data includes common instruction-style conversations, and security- and service-oriented data, which involves substantial annotation efforts. +Instruction data covers broad abilities, such as writing, question answering, brainstorming and planning, content understanding, summarization, natural language processing, and coding. +Security data tries to prevent the model from generating harmful and inappropriate content. +Service data tries to enhance the model with specific conversation patterns that can be parsed to invoke and incorporate external systems. + +**Data formatting**: +Since the data consists of conversation turns, we arrange them into texts using the [ChatML](https://github.com/openai/openai-python/blob/main/chatml.md) format, which is a meta language that can describe both the metadata (e.g., roles) and the content of a turn. +Currently, existing roles include system, user, and assistant. + +### Model + +**Training details**: +The causal language modeling objective is used to fine-tune the model, except for the tokens in the content of user's turns. +The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. +The sequence length is limited to 2048, and the batch size is 128. +The model is trained for 4000 steps, and over the first 1430 steps, the learning rate is warmed up to $1 \times 10^{-5}$. +We use weight decay of 0.1, dropout of 0.1, and gradient clipping of 1.0. + +### Evaluation + +Evaluation of human-aligned models is non-trivial and often non-standardized, since such models often target specific applications. +We evaluate Qwen-7B-Chat from multiple perspectives. + +#### World knowledge + +As fine-tuning uses a much smaller dataset than pretraining and humans' understanding of world knowledge may be limited, we also evaluate the world knowledge of Qwen-7B-Chat using C-Eval and MMLU in a zero-shot and generative manner. + +We demonstrate the zero-shot accuracy of Qwen-7B-Chat on the C-Eval validation set. + +| Model | Avg. Acc. | +| :---------------------- | --------: | +| LLaMA2-7B-Chat | 31.9 | +| LLaMA2-13B-Chat | 40.6 | +| Chinese-Alpaca-2-7B | 41.3 | +| Chinese-Alpaca-Plus-13B | 43.3 | +| Baichuan-13B-Chat | 50.4 | +| ChatGLM2-6B-Chat | 50.7 | +| InternLM-7B-Chat | 53.2 | +| **Qwen-7B-Chat** | **54.2** | + +The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below + +| Model | Avg. | STEM | Social Sciences | Humanities | Others | +| :---------------------- | -------: | ---: | --------------: | ---------: | -----: | +| Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 | +| Chinese-Alpaca-2-7B | 40.3 | - | - | - | - | +| ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | +| Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 | +| **Qwen-7B-Chat** | **54.6** | 47.8 | 67.6 | 59.3 | 50.6 | + +Compared with other models with comparable model sizes, the human-aligned Qwen-7B-Chat performs well in C-Eval accuracy. + +The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below. +The performance of Qwen-7B-Chat is still on top among other human-aligned models with comparable size. + +| Model | Avg. Acc. | +| :---------------- | --------: | +| ChatGLM2-6B-Chat | 45.5 | +| LLaMA2-7B-Chat | 47.0 | +| InternLM-7B-Chat | 50.8 | +| Baichuan-13B-Chat | 52.1 | +| ChatGLM2-12B-Chat | 52.1 | +| **Qwen-7B-Chat** | **53.9** | + +#### Coding + +The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below + +| Model | Pass@1 | +| :---------------- | -------: | +| LLaMA2-7B-Chat | 12.2 | +| InternLM-7B-Chat | 14.0 | +| Baichuan-13B-Chat | 16.5 | +| LLaMA2-13B-Chat | 18.9 | +| **Qwen-7B-Chat** | **21.3** | + +#### Math + +The accuracy of Qwen-7B-Chat on GSM8K is shown below + +| Model | Zero-shot Acc. | 4-shot Acc. | +| :---------------- | -------------: | ----------: | +| ChatGLM2-6B-Chat | - | 28.0 | +| LLaMA2-7B-Chat | 20.4 | 28.2 | +| LLaMA2-13B-Chat | 29.4 | 36.7 | +| InternLM-7B-Chat | 32.6 | 34.5 | +| Baichuan-13B-Chat | - | 36.3 | +| ChatGLM2-12B-Chat | - | 38.1 | +| **Qwen-7B-Chat** | **41.1** | **43.5** | + +#### Service + +LLMs have shown capability in coordinating multiple external systems to achieve the given instructions, which creates new opportunities in traditional online services, the most notable being web search. + +Qwen supports calling plugins/tools/APIs through [ReAct Prompting](https://arxiv.org/abs/2210.03629). +ReAct is also one of the main approaches used by the [LangChain](https://python.langchain.com/) framework. +For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). +In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen's performance is as follows: + +| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | +| :---------- | --------------------------: | -------------------------: | -------------------------: | +| GPT-4 | 95% | **0.90** | 15.0% | +| GPT-3.5 | 85% | 0.88 | 75.0% | +| **Qwen-7B** | **99%** | 0.89 | **8.5%** | + +> The plugins that appear in the evaluation set do not appear in the training set of Qwen. +> This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. +> False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query. + +Qwen also has the capability to be used as a [HuggingFace Agent](https://huggingface.co/docs/transformers/transformers_agents). +Its performance on the benchmark provided by HuggingFace is as follows: + +| Model | Tool Selection↑ | Tool Used↑ | Code↑ | +| :-------------- | -------------------: | --------------: | ---------: | +| GPT-4 | **100.00** | **100.00** | **97.41** | +| GPT-3.5 | 95.37 | 96.30 | 87.04 | +| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | +| **Qwen-7B** | 90.74 | 92.59 | 74.07 | + +## Conclusion + +In this document, we describe Qwen-7B, including a pretrained model and a human-aligned model. +These models have demonstrated exciting performance compared to existing open models of similar or even larger scales. +As part of our ongoing commitment to the concept of Model as a Service, the release also includes practical pieces such as long context inference and external system integration, which we hope would facilitate developers realizing their own ideas and concepts. +We believe that the open release of Qwen-7B models would further our understanding of variables and techniques introduced in realistic settings and help to drive progress in this important area together with the community.