release latest models

main
yangapku 1 year ago
parent fb3180d8f0
commit fc57dea277

@ -4,7 +4,7 @@
#### Failure in installing flash attention #### Failure in installing flash attention
Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. You can use our models without installing it. Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. **You can use our models without installing it.**
#### Which version of transformers should I use? #### Which version of transformers should I use?
@ -20,7 +20,7 @@ This is the merge file of the tokenizer. You have to download it. Note that if y
#### transformers_stream_generator/tiktoken/accelerate not found #### transformers_stream_generator/tiktoken/accelerate not found
Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt). Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt).
<br><br> <br><br>
@ -32,7 +32,6 @@ Run the command `pip install -r requirements.txt`. You can find the file at [htt
Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information. Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information.
#### Can I use CPU only? #### Can I use CPU only?
Yes, run `python cli_demo.py --cpu-only` will load the model and inference on CPU only. Yes, run `python cli_demo.py --cpu-only` will load the model and inference on CPU only.
@ -47,19 +46,16 @@ This is because tokens represent bytes and a single token may be a meaningless s
#### It seems that the generation is not related to the instruction... #### It seems that the generation is not related to the instruction...
Please check if you are loading Qwen-7B-Chat instead of Qwen-7B. Qwen-7B is the base model without alignment, which behaves differently from the SFT/Chat model. Please check if you are loading Qwen-Chat instead of Qwen. Qwen is the base model without alignment, which behaves differently from the SFT/Chat model.
#### Is quantization supported? #### Is quantization supported?
Yes, the quantization is supported by `bitsandbytes`. We are working on an improved version and will release the quantized model checkpoints. Yes, the quantization is supported by AutoGPTQ.
#### Errors in running quantized models: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
For Linux usersrunning `pip install bitsandbytes` directly can solve the problem. For Windows users, you can run `python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`·
#### Slow when processing long sequences #### Slow when processing long sequences
We solved this problem. Updating the code to the latest version can help. Updating the code to the latest version can help.
#### Unsatisfactory performance in processing long sequences #### Unsatisfactory performance in processing long sequences
@ -72,7 +68,9 @@ Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `conf
#### Can Qwen support SFT or even RLHF? #### Can Qwen support SFT or even RLHF?
We do not provide finetuning or RLHF codes for now. However, some projects have supported finetuning, see [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc. We will soon update the relevant codes. Yes, we now support SFT, including full-parameter finetuning, LoRA, and Q-LoRA. Also you can check other projects like [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc.
However, temporarily we do not support RLHF. We will provide the code in the near future.
<br><br> <br><br>

@ -20,7 +20,7 @@ Flash attention は、トレーニングと推論を加速するオプション
#### transformers_stream_generator/tiktoken/accelerate が見つかりません。 #### transformers_stream_generator/tiktoken/accelerate が見つかりません。
コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) にあります。 コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) にあります。
<br><br> <br><br>
@ -47,19 +47,16 @@ Flash attention は、トレーニングと推論を加速するオプション
#### インストラクションとは関係ないようですが... #### インストラクションとは関係ないようですが...
Qwen-7B ではなく Qwen-7B-Chat を読み込んでいないか確認してください。Qwen-7B はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。 Qwen ではなく Qwen-Chat を読み込んでいないか確認してください。Qwen はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。
#### 量子化はサポートされていますか? #### 量子化はサポートされていますか?
はい、量子化は `bitsandbytes` でサポートされています。私たちは改良版の開発に取り組んでおり、量子化されたモデルのチェックポイントをリリースする予定です。 はい、量子化は AutoGPTQ でサポートされています。
#### 量子化モデル実行時のエラー: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
Linux ユーザの場合は,`pip install bitsandbytes` を直接実行することで解決できます。Windows ユーザの場合は、`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui` を実行することができます。
#### 長いシーケンスの処理に時間がかかる #### 長いシーケンスの処理に時間がかかる
この問題は解決しました。コードを最新版に更新することで解決します。 コードを最新版に更新することで解決します。
#### 長いシーケンスの処理で不満足なパフォーマンス #### 長いシーケンスの処理で不満足なパフォーマンス
@ -72,7 +69,7 @@ NTK が適用されていることを確認してください。`config.json`
#### Qwen は SFT、あるいは RLHF に対応できますか? #### Qwen は SFT、あるいは RLHF に対応できますか?
今のところ、ファインチューニングや RLHF のコードは提供していません。しかし、[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。 SFTのコードは提供します。[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。
<br><br> <br><br>

@ -20,7 +20,7 @@ flash attention是一个用于加速模型训练推理的可选项且仅适
#### transformers_stream_generator/tiktoken/accelerate这几个库提示找不到怎么办 #### transformers_stream_generator/tiktoken/accelerate这几个库提示找不到怎么办
运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) 可以找到。 运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) 可以找到。
<br><br> <br><br>
@ -44,19 +44,15 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数
#### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的 #### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的
请检查是否加载的是Qwen-7B-Chat模型进行推理Qwen-7B模型是未经align的预训练基模型不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查避免您误将预训练模型作为SFT/Chat模型使用。 请检查是否加载的是Qwen-Chat模型进行推理Qwen模型是未经align的预训练基模型不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查避免您误将预训练模型作为SFT/Chat模型使用。
#### 是否有量化版本模型 #### 是否有量化版本模型
目前Qwen支持基于`bitsandbytes`的8-bit和4-bit的量化推理。后续我们将进一步更新提供更加高效的量化推理实现并提供对应的量化模型。 目前Qwen支持基于AutoGPTQ的4-bit的量化推理。
#### 运行量化推理报错:`importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
对于linux 用户,直接`pip install bitsandbytes`即可。对于windows用户可以 运行`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`。
#### 生成序列较长后速度显著变慢 #### 生成序列较长后速度显著变慢
这一问题已经在最新版本中修复。请更新到最新代码。 请更新到最新代码。
#### 处理长序列时效果有问题 #### 处理长序列时效果有问题
@ -68,7 +64,9 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数
#### 当前是否支持SFT和RLHF #### 当前是否支持SFT和RLHF
我们目前未提供SFT和RLHF代码。当前有多个外部项目已实现支持如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。 我们目前提供了SFT的代码支持全参数微调、LoRA和Q-LoRA。此外当前有多个外部项目也已实现支持如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。
我们还没提供对RLHF训练的支持敬请期待。
<br><br> <br><br>

@ -9,7 +9,7 @@ By clicking to agree or by using or distributing any portion or element of the T
b. "We"(or "Us") shall mean Alibaba Cloud. b. "We"(or "Us") shall mean Alibaba Cloud.
c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
e. "Tongyi Qianwen" shall mean the large language models (including Qwen-7B model and Qwen-7B-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,

@ -4,36 +4,47 @@
<br><br> <br><br>
<p align="center"> <p align="center">
<img src="assets/logo.jpg" width="400"/> <img src="assets/logoqwen.jpg" width="400"/>
<p> <p>
<br> <br>
<p align="center"> <p align="center">
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">ModelScope<a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper&nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
<br> <br>
<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a> <a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
</p> </p>
<br><br> <br><br>
__Will be back soon...__ | | Qwen-Chat | Qwen-Chat (Int4) | Qwen |
|----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
---
We opensource **Qwen-7B** and **Qwen-7B-Chat** on both **🤖 ModelScope** and **🤗 Hugging Face** (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo [link](tech_memo.md) that provides more information.
Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include: We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-7B** and **Qwen-14B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-7B-Chat** and **Qwen-14B-Chat**. Links are on the above table. Click them and check the model cards.
1. **Trained with high-quality pretraining data**. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data. In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc.
2. **Strong performance**. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.
3. **Better support of languages**. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.
4. **Support of 8K Context Length**. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
5. **Support of Plugins**. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.
The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues. In this repo, you can figure out:
* Quickstart with Qwen, and enjoy the simple inference.
* Details about the quantization models, including usage, memory, inference speed. For comparison, we also provide the statistics of the BF16 models.
* Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA.
* Instructions on building demos, including WebUI, CLI demo, etc.
* Information about Qwen for tool use, agent, and code interpreter
* Statistics of long-context understanding evaluation
* License agreement
* ...
Also, if you meet problems, turn to [FAQ](FAQ.md) for help first. Still feeling struggled? Feel free to shoot us issues (better in English so that more people can understand you)! If you would like to help us, send us pull requests with no hesitation! We are always excited about PR!
Would like to chat with us or date us coffee time? Welcome to our Discord or WeChat!
<br><br> <br><br>
## News and Updates ## News and Updates
* 2023.9.25 🔥 We release both **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face. At the same time, we update **Qwen-7B** and **Qwen-7B-Chat**. Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. **PLEASE MAKE SURE YOU ARE USING THE LATEST CODES AND CHECKPOINTS!**
* 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA. * 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA.
* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation. * 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance. * 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
@ -41,29 +52,31 @@ The following sections include information that you might find it helpful. Speci
## Performance ## Performance
In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below. Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the context length is extended from 2048 to 8192) outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. However, even Qwen-14B still significantly fall behind GPT-3.5, let alone GPT-4. See the results below.
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
|:------------------|:--------:|:--------:|:--------:|:---------:|:-------------:|:--------:|
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
<p align="center"> <p align="left">
<img src="assets/performance.png" width="1000"/> <img src="assets/radar_14b.jpg" width="600"/>
<p> <p>
<br> <br>
Additionally, according to the third-party evaluation of large language models, conducted by [OpenCompass](https://opencompass.org.cn/leaderboard-llm), Qwen-7B and Qwen-7B-Chat are the top 7B-parameter models. This evaluation consists of a large amount of public benchmarks for the evaluation of language understanding and generation, coding, mathematics, reasoning, etc. | Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md). | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 |
| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm).
For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical report by clicking [here](TODO).
<br><br> <br><br>
## Requirements ## Requirements
@ -76,7 +89,7 @@ For more experimental results (detailed model performance on more benchmark data
## Quickstart ## Quickstart
Below, we provide simple examples to show how to use Qwen-7B with 🤖 ModelScope and 🤗 Transformers. Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers.
Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
@ -98,13 +111,13 @@ Now you can start with ModelScope or Transformers.
#### 🤗 Transformers #### 🤗 Transformers
To use Qwen-7B-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.** To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**
```python ```python
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig from transformers.generation import GenerationConfig
# Note: The default behavior now has injection attack prevention off. # Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# use bf16 # use bf16
@ -144,15 +157,16 @@ print(response)
# 《奋斗创业:一个年轻人的成功之路》 # 《奋斗创业:一个年轻人的成功之路》
``` ```
Running Qwen-7B pretrained base model is also simple. Running Qwen pretrained base model is also simple.
<details> <details>
<summary>Running Qwen-7B</summary> <summary>Running Qwen</summary>
```python ```python
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# use bf16 # use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval() # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
@ -187,6 +201,7 @@ ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provid
from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig from modelscope import GenerationConfig
# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
@ -200,16 +215,11 @@ print(response)
``` ```
<br> <br>
## Tokenizer
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
<br><br>
## Quantization ## Quantization
### Usage ### Usage
**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.** We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4) and Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4), which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
@ -222,6 +232,7 @@ If you meet problems installing `auto-gptq`, we advise you to check out the offi
Then you can load the quantized model easily and run inference as same as usual: Then you can load the quantized model easily and run inference as same as usual:
```python ```python
# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-7B-Chat-Int4",
device_map="auto", device_map="auto",
@ -229,23 +240,28 @@ model = AutoModelForCausalLM.from_pretrained(
).eval() ).eval()
response, history = model.chat(tokenizer, "Hi", history=None) response, history = model.chat(tokenizer, "Hi", history=None)
``` ```
### Performance ### Performance
We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|--------------|:----:|:-----------:|:-----:|:---------:| |----------------------|:----:|:-----------:|:-----:|:---------:|
| BF16 | 53.9 | 54.2 | 41.1 | 24.4 | | Qwen-7B-Chat (BF16) | 53.9 | 54.2 | 41.1 | 24.4 |
| Int4 | 52.6 | 52.9 | 38.1 | 23.8 | | Qwen-7B-Chat (Int4) | 52.6 | 52.9 | 38.1 | 23.8 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 61.0 | 43.9 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
### Inference Speed ### Inference Speed
We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively. We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
|--------------|:-------------------:|:-------------------:| |----------------------|:-------------------:|:-------------------:|
| BF16 | 30.34 | 29.32 | | Qwen-7B-Chat (BF16) | 30.34 | 29.32 |
| Int4 | 43.56 | 33.92 | | Qwen-7B-Chat (Int4) | 43.56 | 33.92 |
| Qwen-14B-Chat (BF16) | 30.70 | 21.73 |
| Qwen-14B-Chat (Int4) | 37.11 | 26.11 |
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens. In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
@ -253,10 +269,12 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below. We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
|--------------|:-----------------------------------:|:-------------------------------------:| |----------------------|:-----------------------------------:|:-------------------------------------:|
| BF16 | 17.66GB | 22.58GB | | Qwen-7B-Chat (BF16) | 17.66GB | 22.58GB |
| Int4 | 8.21GB | 13.62GB | | Qwen-7B-Chat (Int4) | 8.21GB | 13.62GB |
| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB |
| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB |
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
<br><br> <br><br>
@ -438,7 +456,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp
If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`: If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`:
```python[](https://) ```python
from utils import load_model_on_gpus from utils import load_model_on_gpus
model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2) model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
``` ```
@ -448,52 +466,270 @@ Then you can run the 7B chat model on 2 GPUs using the above scripts.
## Tool Usage ## Tool Usage
Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance. Qwen-Chat has been optimized for tool usage and function calling capabilities. Users can develop agents, LangChain applications, and even agument Qwen with a Python Code Interpreter.
We provide documentation on how to implement tool calls based on the principle of ReAct Prompting, please refer to [the ReAct example](examples/react_prompt.md). Based on this principle, we provide support for function calling in [openai_api.py](openai_api.py).
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | We have tested the model's tool calling capabilities on our open-source Chinese evaluation benchmark and found that Qwen-Chat consistently performs well:
|:-----------------|:----------------------:|:---------------------:|:---------------------:|
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** |
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks. <table>
<tr>
<th colspan="4" align="center">Chinese Tool-Use Benchmark</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">95%</td><td align="center">0.90</td><td align="center">15.0%</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
</tr>
</table>
Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows: To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark).
| Model | Tool Selection↑ | Tool Used↑ | Code↑ | We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:
|:-----------------|:---------------:|:----------:|:---------:|
| GPT-4 | **100** | **100** | **97.41** | <table>
| GPT-3.5 | 95.37 | 96.30 | 87.04 | <tr>
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | <th colspan="4" align="center">Executable Rate of Generated Code (%)</th>
| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | </tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">91.9</td><td align="center">85.9</td><td align="center">82.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">89.2</td><td align="center">65.0</td><td align="center">74.1</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">33.1</td>
<td align="center">24.1 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">50.0</td>
<td align="center">40.5</td>
<td align="center">48.3 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">85.1</td>
<td align="center">54.0</td>
<td align="center">70.7 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">93.2</td>
<td align="center">55.8</td>
<td align="center">74.1 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">78.4</td>
<td align="center">44.2</td>
<td align="center">62.1 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">70.3</td>
<td align="center">44.2</td>
<td align="center">65.5 </td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">82.4</td>
<td align="center">64.4</td>
<td align="center">67.2 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">89.2</td>
<td align="center">84.1</td>
<td align="center">65.5</td>
</tr>
</table>
<table>
<tr>
<th colspan="4" align="center">Accuracy of Code Execution Results (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">82.8</td><td align="center">66.7</td><td align="center">60.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">47.3</td><td align="center">33.3</td><td align="center">55.7</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">3.9</td>
<td align="center">14.3</td>
<td align="center">39.2 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">8.3</td>
<td align="center">8.3</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">14.3</td>
<td align="center">26.2</td>
<td align="center">60.8 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">28.2</td>
<td align="center">27.4</td>
<td align="center">62.0 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">28.5</td>
<td align="center">4.8</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">34.6</td>
<td align="center">21.4</td>
<td align="center">45.6 </td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">40.5</td>
<td align="center">54.4 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">58.4</td>
<td align="center">53.6</td>
<td align="center">59.5</td>
</tr>
</table>
<p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
In addition, we also provide experimental results demonstrating that our model is capable of acting as a HuggingFace Agent. For more information, please refer to the [example documentation](examples/transformers_agent.md). The model's performance on the evaluation dataset provided by Hugging Face is as follows:
<table>
<tr>
<th colspan="4" align="center">HuggingFace Agent Benchmark- Run Mode</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">100</td><td align="center">100</td><td align="center">97.4</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">95.4</td><td align="center">96.3</td><td align="center">87.0</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">86.1</td><td align="center">87.0</td><td align="center">68.9</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
</tr>
</table>
<table>
<tr>
<th colspan="4" align="center">HuggingFace Agent Benchmark - Chat Mode</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">98.5</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">97.3</td><td align="center">96.8</td><td align="center">89.6</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">91.1</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
</tr>
</table>
<br> <br>
## Long-Context Understanding ## Long-Context Understanding
To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen-7B can reach outstanding performance in the scenario of long context. Results are demonstrated below: To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-7B/14B from 2k to over 8K tokens, and Qwen-7B from 8k to 32k tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below:
<table> <table>
<tr> <tr>
<th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th> <th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
</tr> </tr>
<tr> <tr>
<th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th> <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
</tr>
<tr>
<td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td> <td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td> <td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td> <td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td> <tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
</tr>
<tr>
<td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr> </tr>
</table> </table>
<br>
## Tokenizer
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
<br><br>
## Reproduction ## Reproduction
@ -507,10 +743,10 @@ If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to sear
## License Agreement ## License Agreement
Researchers and developers are free to use the codes and model weights of both Qwen-7B and Qwen-7B-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply. Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
<br><br> <br><br>
## Contact Us ## Contact Us
If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com. If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.

@ -4,32 +4,45 @@
<br><br> <br><br>
<p align="center"> <p align="center">
<img src="assets/logo.jpg" width="400"/> <img src="assets/logoqwen.jpg" width="400"/>
<p> <p>
<br> <br>
<p align="center"> <p align="center">
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">魔搭社区<a>&nbsp&nbsp | &nbsp&nbsp 📑 论文&nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
<br> <br>
<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a> <a href="assets/wechat.png">微信</a>&nbsp&nbsp &nbsp&nbsp 钉钉 &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
</p> </p>
<br><br> <br><br>
我们在🤖 **ModelScope**以及🤗 **Hugging Face**均开源了**Qwen-7B**系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息请点击[链接](tech_memo.md)查看我们的技术备忘录。 | | Qwen-Chat | Qwen-Chat (Int4) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖 <a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
通义千问-7BQwen-7B 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样覆盖广泛包括大量网络文本、专业书籍、代码等。同时在Qwen-7B的基础上我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括 我们开源了**Qwen**通义千问系列工作当前开源模型的参数规模为70亿7B和140亿14B。本次开源包括基础模型**Qwen**,即**Qwen-7B**和**Qwen-14B**,以及对话模型**Qwen-Chat**,即**Qwen-7B-Chat**和**Qwen-14B-Chat**。模型链接在表格中,请点击了解详情。
1. **大规模高质量预训练数据**我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型覆盖通用领域和专业领域。 当前基础模型已经稳定训练了大规模高质量且多样化的数据覆盖多语言当前绝以中文和英文为主总量高达3万亿token。在相关基准评测中Qwen系列模型拿出非常有竞争力的表现显著超出同规模模型并紧追一系列最强的闭源模型。此外我们利用SFT和RLHF技术实现对齐从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力同时还具备一定的代码生成和简单数学推理的能力。在此基础上我们针对LLM对接外部系统等方面针对性地做了优化当前具备较强的工具调用能力以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。
2. **优秀的模型性能**相比同规模的开源模型Qwen-7B在多个评测数据集上具有显著优势甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。
3. **更好地支持多语言**基于更大词表的分词器在分词上更高效同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。
4. **8K的上下文长度**Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。
5. **支持插件调用**Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化当前模型能有效调用插件以及升级为Agent。
以下章节的信息可能对你有帮助建议阅读。如果你在使用过程遇到问题建议先查询FAQ如仍无法解决再提交issue。 在这个项目中,你可以了解到以下内容
* 快速上手Qwen-Chat教程玩转大模型推理.
* 量化模型相关细节,包括用法、显存占用、推理性能等。这部分还提供了和非量化模型的对比。
* 微调的教程帮你实现全参数微调、LoRA以及Q-LoRA。
* 搭建Demo的方法包括WebUI和CLI Demo
* 更多关于Qwen在工具调用、Code Interpreter、Agent方面的内容
* 长序列理解能力及评测
* 使用协议
* ...
如果遇到问题,请优先考虑查询[FAQ](FAQ.md)。如仍未解决随时提出issue但建议使用英语或提供翻译有助于帮助更多用户。如果想帮助我们提升欢迎提交Pull Requests
想和我们一起讨论和聊天的话赶紧加入我们的微信群和Discord server入口见文档开头部分
<br><br> <br><br>
## 新闻 ## 新闻
* 2023年9月25日 在魔搭社区ModelScope和Hugging Face同步推出Qwen-14B和Qwen-14B-Chat模型。
* 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调其中包括全参数微调、LoRA以及Q-LoRA。 * 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调其中包括全参数微调、LoRA以及Q-LoRA。
* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型Qwen-7B-Chat-Int4。该模型显存占用低推理速度相比半精度模型显著提升在基准评测上效果损失较小。 * 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型Qwen-7B-Chat-Int4。该模型显存占用低推理速度相比半精度模型显著提升在基准评测上效果损失较小。
* 2023年8月3日 在魔搭社区ModelScope和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时我们发布了技术备忘录介绍了相关的训练细节和模型表现。 * 2023年8月3日 在魔搭社区ModelScope和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时我们发布了技术备忘录介绍了相关的训练细节和模型表现。
@ -37,29 +50,32 @@
## 评测表现 ## 评测表现
Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、代码生成等能力的评测数据集上包括MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU等均超出了同规模大语言模型的表现甚至超出了如12-13B参数等更大规模的语言模型。 Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。当然即便Qwen-14B相比GPT-3.5和GPT-4仍有差距。
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
<p align="center"> <p align="left">
<img src="assets/performance.png" width="1000"/> <img src="assets/radar_14b.jpg" width="600"/>
<p> <p>
<br> <br>
此外,根据[OpenCompass](https://opencompass.org.cn/leaderboard-llm)进行的大型语言模型第三方评估Qwen-7B 和 Qwen-7B-Chat 是其中表现最优的7B参数模型。该评估由大量公开基准组成用于评估语言理解和生成、代码生成、数学、推理等。 | Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:-----------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。 | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 |
| **Qwen-7B (original)** | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
对于以上所有对比模型,我们列出了其官方汇报结果与[OpenCompass](https://opencompass.org.cn/leaderboard-llm)结果之间的最佳分数。
更多的实验结果和细节请查看我们的技术备忘录。点击[这里](TODO)。
<br><br> <br><br>
## 要求 ## 要求
@ -93,13 +109,13 @@ cd flash-attention && pip install .
#### 🤗 Transformers #### 🤗 Transformers
如希望使用Qwen-7B-chat进行推理所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码** 如希望使用Qwen-chat进行推理所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码,并指定正确的模型名称和路径,如`Qwen/Qwen-7B-Chat`和`Qwen/Qwen-14B-Chat`**
```python ```python
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig from transformers.generation import GenerationConfig
# 请注意分词器默认行为已更改为默认关闭特殊token攻击防护。 # 可选的模型包括: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 打开bf16精度A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # 打开bf16精度A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
@ -135,15 +151,16 @@ print(response)
# 《奋斗创业:一个年轻人的成功之路》 # 《奋斗创业:一个年轻人的成功之路》
``` ```
运行Qwen-7B同样非常简单。 运行Qwen同样非常简单。
<details> <details>
<summary>运行Qwen-7B</summary> <summary>运行Qwen</summary>
```python ```python
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig from transformers.generation import GenerationConfig
# 可选的模型包括: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# 打开bf16精度A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # 打开bf16精度A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
@ -175,6 +192,7 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig from modelscope import GenerationConfig
# 可选的模型包括: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
@ -188,18 +206,11 @@ print(response)
``` ```
<br> <br>
## Tokenization
> 注作为术语的“tokenization”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
基于tiktoken的tokenizer有别于其他分词器比如sentencepiece tokenizer。尤其在微调阶段需要特别注意特殊token的使用。关于tokenizer的更多信息以及微调时涉及的相关使用请参阅[文档](tokenization_note_zh.md)。
<br><br>
## 量化 ## 量化
### 用法 ### 用法
**请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化,提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案,该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。** **请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化提供Int4量化模型包括Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)和Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4)。该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。**
以下我们提供示例说明如何使用Int4量化模型。在开始使用前请先保证满足要求如torch 2.0及以上transformers版本为4.32.0及以上,等等),并安装所需安装包: 以下我们提供示例说明如何使用Int4量化模型。在开始使用前请先保证满足要求如torch 2.0及以上transformers版本为4.32.0及以上,等等),并安装所需安装包:
@ -212,6 +223,7 @@ pip install auto-gptq optimum
随后即可使用和上述一致的用法调用量化模型: 随后即可使用和上述一致的用法调用量化模型:
```python ```python
# 可选模型包括:"Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-7B-Chat-Int4",
device_map="auto", device_map="auto",
@ -223,19 +235,23 @@ response, history = model.chat(tokenizer, "Hi", history=None)
我们对BF16和Int4模型在基准评测上做了测试发现量化模型效果损失较小结果如下所示 我们对BF16和Int4模型在基准评测上做了测试发现量化模型效果损失较小结果如下所示
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
| ------------- | :--------: | :----------: | :----: | :--------: | |----------------------|:----:|:-----------:|:-----:|:---------:|
| BF16 | 53.9 | 54.2 | 41.1 | 24.4 | | Qwen-7B-Chat (BF16) | 53.9 | 54.2 | 41.1 | 24.4 |
| Int4 | 52.6 | 52.9 | 38.1 | 23.8 | | Qwen-7B-Chat (Int4) | 52.6 | 52.9 | 38.1 | 23.8 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 61.0 | 43.9 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
### 推理速度 ### 推理速度
我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度tokens/s。如图所示 我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度tokens/s。如图所示
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------- | :------------------:| :------------------:| |----------------------|:-------------------:|:-------------------:|
| BF16 | 30.34 | 29.32 | | Qwen-7B-Chat (BF16) | 30.34 | 29.32 |
| Int4 | 43.56 | 33.92 | | Qwen-7B-Chat (Int4) | 43.56 | 33.92 |
| Qwen-14B-Chat (BF16) | 30.70 | 21.73 |
| Qwen-14B-Chat (Int4) | 37.11 | 26.11 |
具体而言我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。 具体而言我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
@ -243,10 +259,12 @@ response, history = model.chat(tokenizer, "Hi", history=None)
我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示 我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------------ | :---------------------------------: | :-----------------------------------: | |----------------------|:-----------------------------------:|:-------------------------------------:|
| BF16 | 17.66GB | 22.58GB | | Qwen-7B-Chat (BF16) | 17.66GB | 22.58GB |
| Int4 | 8.21GB | 13.62GB | | Qwen-7B-Chat (Int4) | 8.21GB | 13.62GB |
| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB |
| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB |
上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。 上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
<br><br> <br><br>
@ -439,54 +457,272 @@ model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
## 工具调用 ## 工具调用
Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。在我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力并发现Qwen-7B-Chat能够取得稳定的表现 Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以开发基于Qwen的Agent、LangChain应用、甚至Code Interpreter
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | 我们提供了文档说明如何根据ReAct Prompting的原理实现工具调用请参见[ReAct示例](examples/react_prompt.md)。基于该原理,我们在 [openai_api.py](openai_api.py) 里提供了函数调用Function Calling的支持。
|:-----------------|:----------------------:|:----------------------:|:----------------------:| 我们在已开源的中文[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力并发现Qwen-Chat能够取得稳定的表现
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** |
我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。 <table>
<tr>
<th colspan="4" align="center">中文工具调用评测基准</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">95%</td><td align="center">0.90</td><td align="center">15.0%</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
</tr>
</table>
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md)。 为了考察Qwen使用Python Code Interpreter完成数学解题、数据可视化、及文件处理与爬虫等任务的能力我们专门建设并开源了一个评测这方面能力的[评测基准](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark)。
我们发现Qwen在生成代码的可执行率、结果正确性上均表现较好
此外我们还提供了实验结果表明我们的模型扮演Agent的能力。请阅读相关文档[链接](https://huggingface.co/docs/transformers/transformers_agents)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下 <table>
<tr>
<th colspan="4" align="center">生成代码的可执行率 (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">91.9</td><td align="center">85.9</td><td align="center">82.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">89.2</td><td align="center">65.0</td><td align="center">74.1</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">33.1</td>
<td align="center">24.1 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">50.0</td>
<td align="center">40.5</td>
<td align="center">48.3 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">85.1</td>
<td align="center">54.0</td>
<td align="center">70.7 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">93.2</td>
<td align="center">55.8</td>
<td align="center">74.1 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">78.4</td>
<td align="center">44.2</td>
<td align="center">62.1 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">70.3</td>
<td align="center">44.2</td>
<td align="center">65.5 </td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">82.4</td>
<td align="center">64.4</td>
<td align="center">67.2 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">89.2</td>
<td align="center">84.1</td>
<td align="center">65.5</td>
</tr>
</table>
| Model | Tool Selection↑ | Tool Used↑ | Code↑ | <table>
|:-----------------|:---------------:|:-----------:|:---------:| <tr>
| GPT-4 | **100** | **100** | **97.41** | <th colspan="4" align="center">代码执行结果的正确率 (%)</th>
| GPT-3.5 | 95.37 | 96.30 | 87.04 | </tr>
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | <tr>
| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | <th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">82.8</td><td align="center">66.7</td><td align="center">60.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">47.3</td><td align="center">33.3</td><td align="center">55.7</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">3.9</td>
<td align="center">14.3</td>
<td align="center">39.2 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">8.3</td>
<td align="center">8.3</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">14.3</td>
<td align="center">26.2</td>
<td align="center">60.8 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">28.2</td>
<td align="center">27.4</td>
<td align="center">62.0 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">28.5</td>
<td align="center">4.8</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">34.6</td>
<td align="center">21.4</td>
<td align="center">45.6 </td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">40.5</td>
<td align="center">54.4 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">58.4</td>
<td align="center">53.6</td>
<td align="center">59.5</td>
</tr>
</table>
<p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
此外我们还提供了实验结果表明我们的模型具备扮演HuggingFace Agent的能力详见[示例文档](examples/transformers_agent.md)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下
<table>
<tr>
<th colspan="4" align="center">HuggingFace Agent评测基准 - Run模式</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">100</td><td align="center">100</td><td align="center">97.4</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">95.4</td><td align="center">96.3</td><td align="center">87.0</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">86.1</td><td align="center">87.0</td><td align="center">68.9</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
</tr>
</table>
<table>
<tr>
<th colspan="4" align="center">HuggingFace Agent评测基准 - Chat模式</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">98.5</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">97.3</td><td align="center">96.8</td><td align="center">89.6</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">91.1</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
</tr>
</table>
<br> <br>
## 长文本理解 ## 长文本理解
我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。我们的模型已经突破8K的序列长度。通过arXiv数据集上的语言模型实验我们发现Qwen-7B能够在长序列的设置下取得不错的表现。 我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。通过arXiv数据集上的语言模型实验我们的原生长度为2K的Qwen-7B/14B在8K的序列长度下依然表现不错而原生长度扩展到8K的Qwen-7B能够在32K长序列的设置下取得不错的表现。
<table> <table>
<tr> <tr>
<th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th> <th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
</tr> </tr>
<tr> <tr>
<th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th> <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
</tr>
<tr>
<td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td> <td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td> <td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td> <td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn + local_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td> <tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center"><b>3.23</b></td><td align="center">3.33</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
</tr>
<tr>
<td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr> </tr>
</table> </table>
<br>
## Tokenization
> 注作为术语的“tokenization”在中文中尚无共识的概念对应本文档采用英文表达以利说明。
基于tiktoken的tokenizer有别于其他分词器比如sentencepiece tokenizer。尤其在微调阶段需要特别注意特殊token的使用。关于tokenizer的更多信息以及微调时涉及的相关使用请参阅[文档](tokenization_note_zh.md)。
<br><br>
## 复现 ## 复现
@ -500,10 +736,10 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct
## 使用协议 ## 使用协议
研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。 研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。我们同样允许商业使用具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
<br><br> <br><br>
## 联系我们 ## 联系我们
如果你想给我们的研发团队和产品团队留言,通过邮件qianwen_opensource@alibabacloud.com联系我们。 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群和Discord server。当然也可以通过邮件qianwen_opensource@alibabacloud.com联系我们。

@ -35,6 +35,7 @@ Qwen-7B は、アリババクラウドが提唱する大規模言語モデルシ
## ニュースとアップデート ## ニュースとアップデート
* 2023.9.25 ModelScope と Hugging Face 上で **Qwen-14B****Qwen-14B-Chat** をリリースしました。
* 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。 * 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。
* 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル **Qwen-7B-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きな性能低下は見られませんでした。 * 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル **Qwen-7B-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きな性能低下は見られませんでした。
* 2023.8.3 ModelScope と Hugging Face 上で **Qwen-7B****Qwen-7B-Chat** をリリースしました。また、トレーニングの詳細やモデルの性能など、モデルの詳細については技術メモを提供しています。 * 2023.8.3 ModelScope と Hugging Face 上で **Qwen-7B****Qwen-7B-Chat** をリリースしました。また、トレーニングの詳細やモデルの性能など、モデルの詳細については技術メモを提供しています。
@ -42,29 +43,27 @@ Qwen-7B は、アリババクラウドが提唱する大規模言語モデルシ
## 性能 ## 性能
Qwen-7B は、MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU など、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのベースラインモデルを凌駕しており、さらには 13B 程度のパラメータを持つより大規模なモデルをも凌駕しています。以下の結果をご覧ください。 Qwen-14B は、MMLU、C-Eval、GSM8K、HumanEval、CMMLU など、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのベースラインモデルを凌駕しており。以下の結果をご覧ください。
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU | | Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: | |:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - | | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - | | LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 | | LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 | | LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - | | ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 | | InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - | | InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - | | Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - | | Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** | | **Qwen-7B** | 56.7 | 59.6 | 51.6 | - | 24.4 | 31.2 | 40.6 | 58.8 |
| **Qwen-7B v1.1** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** |
<p align="center">
<img src="assets/performance.png" width="1000"/>
<p>
<br>
さらに、[OpenCompass](https://opencompass.org.cn/leaderboard-llm) が実施した大規模言語モデルの第三者評価によると、Qwen-7B と Qwen-7B-Chat は 7B パラメータモデルのトップになります。この評価は、言語理解・生成、コーディング、数学、推論などの評価のための大量の公開ベンチマークで構成されています。 比較されたすべてのモデルについて、公式に報告された結果と[OpenCompass](https://opencompass.org.cn/leaderboard-llm) の間の最高スコアを報告します。
より詳細な実験結果(より多くのベンチマークデータセットでの詳細なモデル性能)や詳細については、[こちら](tech_memo.md)をクリックして技術メモを参照してください。 より詳細な実験結果(より多くのベンチマークデータセットでの詳細なモデル性能)や詳細については、[こちら](TODO)をクリックして技術メモを参照してください。
<br><br> <br><br>
## 必要条件 ## 必要条件
@ -442,22 +441,217 @@ model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。 Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | <table>
|:-----------------|:----------------------:|:----------------------:|:----------------------:| <tr>
| GPT-4 | 95% | **0.90** | 15% | <th colspan="4" align="center">Chinese Tool-Use Benchmark</th>
| GPT-3.5 | 85% | 0.88 | 75% | </tr>
| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** | <tr>
<th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">95%</td><td align="center">0.90</td><td align="center">15.0%</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
</tr>
</table>
<table>
<tr>
<th colspan="4" align="center">Using Code Interpreter - Executable Rate of Generated Code (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">91.9</td><td align="center">85.9</td><td align="center">82.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">89.2</td><td align="center">65.0</td><td align="center">74.1</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">33.1</td>
<td align="center">24.1 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">50.0</td>
<td align="center">40.5</td>
<td align="center">48.3 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">85.1</td>
<td align="center">54.0</td>
<td align="center">70.7 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">93.2</td>
<td align="center">55.8</td>
<td align="center">74.1 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">78.4</td>
<td align="center">44.2</td>
<td align="center">62.1 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">70.3</td>
<td align="center">44.2</td>
<td align="center">65.5 </td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td>
<td align="center">82.4</td>
<td align="center">64.4</td>
<td align="center">67.2 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">89.2</td>
<td align="center">84.1</td>
<td align="center">65.5</td>
</tr>
</table>
<table>
<tr>
<th colspan="4" align="center">Using Code Interpreter - Accuracy of Code Execution Results (%)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">82.8</td><td align="center">66.7</td><td align="center">60.8</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">47.3</td><td align="center">33.3</td><td align="center">55.7</td>
</tr>
<tr>
<td>LLaMA2-7B-Chat</td>
<td align="center">3.9</td>
<td align="center">14.3</td>
<td align="center">39.2 </td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">8.3</td>
<td align="center">8.3</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>CodeLLaMA-7B-Instruct</td>
<td align="center">14.3</td>
<td align="center">26.2</td>
<td align="center">60.8 </td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">28.2</td>
<td align="center">27.4</td>
<td align="center">62.0 </td>
</tr>
<tr>
<td>InternLM-7B-Chat-v1.1</td>
<td align="center">28.5</td>
<td align="center">4.8</td>
<td align="center">40.5 </td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">34.6</td>
<td align="center">21.4</td>
<td align="center">45.6 </td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td>
<td align="center">41.9</td>
<td align="center">40.5</td>
<td align="center">54.4 </td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">58.4</td>
<td align="center">53.6</td>
<td align="center">59.5</td>
</tr>
</table>
ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。 ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。
さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです: <p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](examples/transformers_agent.md) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです:
<table>
<tr>
<th colspan="4" align="center">HuggingFace Agent Benchmark- Run Mode</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">100</td><td align="center">100</td><td align="center">97.4</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">95.4</td><td align="center">96.3</td><td align="center">87.0</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">86.1</td><td align="center">87.0</td><td align="center">68.9</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
</tr>
</table>
| Model | Tool Selection↑ | Tool Used↑ | Code↑ | <table>
|:-----------------|:---------------:|:-----------:|:---------:| <tr>
| GPT-4 | **100** | **100** | **97.41** | <th colspan="4" align="center">HuggingFace Agent Benchmark - Chat Mode</th>
| GPT-3.5 | 95.37 | 96.30 | 87.04 | </tr>
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | <tr>
| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | <th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">98.5</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">97.3</td><td align="center">96.8</td><td align="center">89.6</td>
</tr>
<tr>
<td>StarCoder-Base-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">91.1</td>
</tr>
<tr>
<td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
</tr>
<tr>Qwen-7B-Chat v1.1
<td>Qwen-7B-Chat v1.1</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
</tr>
</table>
<br> <br>
@ -467,25 +661,40 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex
<table> <table>
<tr> <tr>
<th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th> <th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
</tr> </tr>
<tr> <tr>
<th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th> <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
</tr>
<tr>
<td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td> <td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td> <td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td> <tr>
<td>Qwen-7B v1.1</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr> </tr>
<tr> <tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td> <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center"><b>3.23</b></td><td align="center">3.33</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
</tr>
<tr>
<td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr> </tr>
</table> </table>
<br>
## 再現 ## 再現

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 139 KiB

After

Width:  |  Height:  |  Size: 79 KiB

@ -12,7 +12,7 @@ cd ../../
# Qwen-7B # Qwen-7B
python evaluate_ceval.py -d data/ceval/ python evaluate_ceval.py -d data/ceval/
# Qwen-7B-Chat # Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).)
pip install thefuzz pip install thefuzz
python evaluate_chat_ceval.py -d data/ceval/ python evaluate_chat_ceval.py -d data/ceval/
``` ```
@ -29,7 +29,7 @@ cd ../../
# Qwen-7B # Qwen-7B
python evaluate_mmlu.py -d data/mmlu/data/ python evaluate_mmlu.py -d data/mmlu/data/
# Qwen-7B-Chat # Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).)
pip install thefuzz pip install thefuzz
python evaluate_chat_mmlu.py -d data/mmlu/data/ python evaluate_chat_mmlu.py -d data/mmlu/data/
``` ```
@ -73,9 +73,8 @@ This program exists to run untrusted model-generated code. Users are strongly en
# Qwen-7B # Qwen-7B
python evaluate_gsm8k.py python evaluate_gsm8k.py
# Qwen-7B-Chat # Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).)
python evaluate_chat_gsm8k.py # zeroshot python evaluate_chat_gsm8k.py # zeroshot
python evaluate_chat_gsm8k.py --use-fewshot # fewshot
``` ```
- PLUGIN - PLUGIN

@ -1,4 +1,3 @@
import re import re
import textwrap import textwrap
import argparse import argparse
@ -19,6 +18,7 @@ evaluate_functional_correctness HumanEval_res.jsonl
DEVICE = "cuda:0" DEVICE = "cuda:0"
def extract_code(text, entry_point): def extract_code(text, entry_point):
# 正则表达式匹配代码块 # 正则表达式匹配代码块
code_block_pattern = re.compile( code_block_pattern = re.compile(
@ -99,7 +99,26 @@ if __name__ == "__main__":
f = jsonlines.open(args.sample_input_file) f = jsonlines.open(args.sample_input_file)
with f_output as output: with f_output as output:
for jobj in tqdm.tqdm(f, desc="task_idx"): for jobj in tqdm.tqdm(f, desc="task_idx"):
prompt = "Help me fill the following code.\n" + jobj["prompt"] # use humanevalpack prompt
signature = re.search(
rf"def\s+({jobj['entry_point']}.*?):\s*\n", jobj["prompt"]
).group(1)
description = "\n".join(
[
line.strip()
for line in re.search(
rf"(?:\"\"\"|''')(.*?)(?:\"\"\"|''')", jobj["prompt"], re.DOTALL
)
.group(1)
.split("\n")
]
)
prompt = (
f"Write a Python function `{signature}` to solve the following problem:\n"
f"{description}\n"
f"{jobj['prompt']}"
)
task_id = jobj["task_id"] task_id = jobj["task_id"]
answer, response = generate_sample( answer, response = generate_sample(
model, tokenizer, prompt, jobj["entry_point"] model, tokenizer, prompt, jobj["entry_point"]

@ -321,7 +321,7 @@ def parse_response(response):
# completion mode, not chat mode # completion mode, not chat mode
def text_complete_last_message(history, stop_words_ids): def text_complete_last_message(history, stop_words_ids, gen_kwargs):
im_start = "<|im_start|>" im_start = "<|im_start|>"
im_end = "<|im_end|>" im_end = "<|im_end|>"
prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}" prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}"
@ -339,7 +339,7 @@ def text_complete_last_message(history, stop_words_ids):
stop_words_ids = _stop_words_ids stop_words_ids = _stop_words_ids
input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device) input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device)
output = model.generate(input_ids, stop_words_ids=stop_words_ids).tolist()[0] output = model.generate(input_ids, stop_words_ids=stop_words_ids, **gen_kwargs).tolist()[0]
output = tokenizer.decode(output, errors="ignore") output = tokenizer.decode(output, errors="ignore")
assert output.startswith(prompt) assert output.startswith(prompt)
output = output[len(prompt) :] output = output[len(prompt) :]
@ -352,6 +352,16 @@ def text_complete_last_message(history, stop_words_ids):
async def create_chat_completion(request: ChatCompletionRequest): async def create_chat_completion(request: ChatCompletionRequest):
global model, tokenizer global model, tokenizer
gen_kwargs = {}
if request.temperature is not None:
if request.temperature < 0.01:
gen_kwargs['top_k'] = 1 # greedy decoding
else:
# Not recommended. Please tune top_p instead.
gen_kwargs['temperature'] = request.temperature
if request.top_p is not None:
gen_kwargs['top_p'] = request.top_p
stop_words = add_extra_stop_words(request.stop) stop_words = add_extra_stop_words(request.stop)
if request.functions: if request.functions:
stop_words = stop_words or [] stop_words = stop_words or []
@ -366,12 +376,12 @@ async def create_chat_completion(request: ChatCompletionRequest):
status_code=400, status_code=400,
detail="Invalid request: Function calling is not yet implemented for stream mode.", detail="Invalid request: Function calling is not yet implemented for stream mode.",
) )
generate = predict(query, history, request.model, stop_words) generate = predict(query, history, request.model, stop_words, gen_kwargs)
return EventSourceResponse(generate, media_type="text/event-stream") return EventSourceResponse(generate, media_type="text/event-stream")
stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None
if query is _TEXT_COMPLETION_CMD: if query is _TEXT_COMPLETION_CMD:
response = text_complete_last_message(history, stop_words_ids=stop_words_ids) response = text_complete_last_message(history, stop_words_ids=stop_words_ids, gen_kwargs=gen_kwargs)
else: else:
response, _ = model.chat( response, _ = model.chat(
tokenizer, tokenizer,
@ -379,6 +389,7 @@ async def create_chat_completion(request: ChatCompletionRequest):
history=history, history=history,
stop_words_ids=stop_words_ids, stop_words_ids=stop_words_ids,
append_history=False, append_history=False,
**gen_kwargs
) )
print(f"<chat>\n{history}\n{query}\n<!-- *** -->\n{response}\n</chat>") print(f"<chat>\n{history}\n{query}\n<!-- *** -->\n{response}\n</chat>")
response = trim_stop_words(response, stop_words) response = trim_stop_words(response, stop_words)
@ -396,7 +407,7 @@ async def create_chat_completion(request: ChatCompletionRequest):
async def predict( async def predict(
query: str, history: List[List[str]], model_id: str, stop_words: List[str] query: str, history: List[List[str]], model_id: str, stop_words: List[str], gen_kwargs: Dict,
): ):
global model, tokenizer global model, tokenizer
choice_data = ChatCompletionResponseStreamChoice( choice_data = ChatCompletionResponseStreamChoice(
@ -416,7 +427,7 @@ async def predict(
detail="Invalid request: custom stop words are not yet supported for stream mode.", detail="Invalid request: custom stop words are not yet supported for stream mode.",
) )
response_generator = model.chat_stream( response_generator = model.chat_stream(
tokenizer, query, history=history, stop_words_ids=stop_words_ids tokenizer, query, history=history, stop_words_ids=stop_words_ids, **gen_kwargs
) )
for new_response in response_generator: for new_response in response_generator:
if len(new_response) == current_length: if len(new_response) == current_length:

Loading…
Cancel
Save