diff --git a/FAQ.md b/FAQ.md index e47fa11..42cab97 100644 --- a/FAQ.md +++ b/FAQ.md @@ -4,7 +4,7 @@ #### Failure in installing flash attention -Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. You can use our models without installing it. +Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. **You can use our models without installing it.** #### Which version of transformers should I use? @@ -20,7 +20,7 @@ This is the merge file of the tokenizer. You have to download it. Note that if y #### transformers_stream_generator/tiktoken/accelerate not found -Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt). +Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt).

@@ -32,7 +32,6 @@ Run the command `pip install -r requirements.txt`. You can find the file at [htt Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information. - #### Can I use CPU only? Yes, run `python cli_demo.py --cpu-only` will load the model and inference on CPU only. @@ -47,19 +46,16 @@ This is because tokens represent bytes and a single token may be a meaningless s #### It seems that the generation is not related to the instruction... -Please check if you are loading Qwen-7B-Chat instead of Qwen-7B. Qwen-7B is the base model without alignment, which behaves differently from the SFT/Chat model. +Please check if you are loading Qwen-Chat instead of Qwen. Qwen is the base model without alignment, which behaves differently from the SFT/Chat model. #### Is quantization supported? -Yes, the quantization is supported by `bitsandbytes`. We are working on an improved version and will release the quantized model checkpoints. - -#### Errors in running quantized models: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes` +Yes, the quantization is supported by AutoGPTQ. -For Linux users,running `pip install bitsandbytes` directly can solve the problem. For Windows users, you can run `python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`· #### Slow when processing long sequences -We solved this problem. Updating the code to the latest version can help. +Updating the code to the latest version can help. #### Unsatisfactory performance in processing long sequences @@ -72,7 +68,9 @@ Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `conf #### Can Qwen support SFT or even RLHF? -We do not provide finetuning or RLHF codes for now. However, some projects have supported finetuning, see [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc. We will soon update the relevant codes. +Yes, we now support SFT, including full-parameter finetuning, LoRA, and Q-LoRA. Also you can check other projects like [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc. + +However, temporarily we do not support RLHF. We will provide the code in the near future.

diff --git a/FAQ_ja.md b/FAQ_ja.md index 821dd2a..fa083c4 100644 --- a/FAQ_ja.md +++ b/FAQ_ja.md @@ -20,7 +20,7 @@ Flash attention は、トレーニングと推論を加速するオプション #### transformers_stream_generator/tiktoken/accelerate が見つかりません。 -コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) にあります。 +コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) にあります。

@@ -47,19 +47,16 @@ Flash attention は、トレーニングと推論を加速するオプション #### インストラクションとは関係ないようですが... -Qwen-7B ではなく Qwen-7B-Chat を読み込んでいないか確認してください。Qwen-7B はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。 +Qwen ではなく Qwen-Chat を読み込んでいないか確認してください。Qwen はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。 #### 量子化はサポートされていますか? -はい、量子化は `bitsandbytes` でサポートされています。私たちは改良版の開発に取り組んでおり、量子化されたモデルのチェックポイントをリリースする予定です。 +はい、量子化は AutoGPTQ でサポートされています。 -#### 量子化モデル実行時のエラー: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes` - -Linux ユーザの場合は,`pip install bitsandbytes` を直接実行することで解決できます。Windows ユーザの場合は、`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui` を実行することができます。 #### 長いシーケンスの処理に時間がかかる -この問題は解決しました。コードを最新版に更新することで解決します。 +コードを最新版に更新することで解決します。 #### 長いシーケンスの処理で不満足なパフォーマンス @@ -72,7 +69,7 @@ NTK が適用されていることを確認してください。`config.json` #### Qwen は SFT、あるいは RLHF に対応できますか? -今のところ、ファインチューニングや RLHF のコードは提供していません。しかし、[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。 +SFTのコードは提供します。[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。

diff --git a/FAQ_zh.md b/FAQ_zh.md index 5550ce7..1161acc 100644 --- a/FAQ_zh.md +++ b/FAQ_zh.md @@ -20,7 +20,7 @@ flash attention是一个用于加速模型训练推理的可选项,且仅适 #### transformers_stream_generator/tiktoken/accelerate,这几个库提示找不到,怎么办? -运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) 可以找到。 +运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) 可以找到。

@@ -44,19 +44,15 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数 #### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的 -请检查是否加载的是Qwen-7B-Chat模型进行推理,Qwen-7B模型是未经align的预训练基模型,不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查,避免您误将预训练模型作为SFT/Chat模型使用。 +请检查是否加载的是Qwen-Chat模型进行推理,Qwen模型是未经align的预训练基模型,不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查,避免您误将预训练模型作为SFT/Chat模型使用。 #### 是否有量化版本模型 -目前Qwen支持基于`bitsandbytes`的8-bit和4-bit的量化推理。后续我们将进一步更新提供更加高效的量化推理实现,并提供对应的量化模型。 - -#### 运行量化推理报错:`importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes` - -对于linux 用户,直接`pip install bitsandbytes`即可。对于windows用户,可以 运行`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`。 +目前Qwen支持基于AutoGPTQ的4-bit的量化推理。 #### 生成序列较长后速度显著变慢 -这一问题已经在最新版本中修复。请更新到最新代码。 +请更新到最新代码。 #### 处理长序列时效果有问题 @@ -68,7 +64,9 @@ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数 #### 当前是否支持SFT和RLHF? -我们目前未提供SFT和RLHF代码。当前有多个外部项目已实现支持,如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。 +我们目前提供了SFT的代码,支持全参数微调、LoRA和Q-LoRA。此外,当前有多个外部项目也已实现支持,如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。 + +我们还没提供对RLHF训练的支持,敬请期待。

diff --git a/LICENSE b/LICENSE index d69279e..5be3338 100644 --- a/LICENSE +++ b/LICENSE @@ -9,7 +9,7 @@ By clicking to agree or by using or distributing any portion or element of the T b. "We"(or "Us") shall mean Alibaba Cloud. c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. - e. "Tongyi Qianwen" shall mean the large language models (including Qwen-7B model and Qwen-7B-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. + e. "Tongyi Qianwen" shall mean the large language models (including Qwen model and Qwen-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, diff --git a/README.md b/README.md index 7f1ec7e..df2bf0a 100644 --- a/README.md +++ b/README.md @@ -4,36 +4,47 @@

- +


- Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  | Qwen-7B-Chat-Int4 🤗 + 🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper   |   🖥️ Demo
-WeChat   |   Discord   |   Demo  |  Report +WeChat (微信)   |    DingTalk (钉钉)    |   Discord  



-__Will be back soon...__ +| | Qwen-Chat | Qwen-Chat (Int4) | Qwen | +|----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | ---- -We opensource **Qwen-7B** and **Qwen-7B-Chat** on both **🤖 ModelScope** and **🤗 Hugging Face** (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo [link](tech_memo.md) that provides more information. -Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include: +We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-7B** and **Qwen-14B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-7B-Chat** and **Qwen-14B-Chat**. Links are on the above table. Click them and check the model cards. -1. **Trained with high-quality pretraining data**. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data. -2. **Strong performance**. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc. -3. **Better support of languages**. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language. -4. **Support of 8K Context Length**. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts. -5. **Support of Plugins**. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent. +In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc. -The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues. +In this repo, you can figure out: + +* Quickstart with Qwen, and enjoy the simple inference. +* Details about the quantization models, including usage, memory, inference speed. For comparison, we also provide the statistics of the BF16 models. +* Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA. +* Instructions on building demos, including WebUI, CLI demo, etc. +* Information about Qwen for tool use, agent, and code interpreter +* Statistics of long-context understanding evaluation +* License agreement +* ... + +Also, if you meet problems, turn to [FAQ](FAQ.md) for help first. Still feeling struggled? Feel free to shoot us issues (better in English so that more people can understand you)! If you would like to help us, send us pull requests with no hesitation! We are always excited about PR! + +Would like to chat with us or date us coffee time? Welcome to our Discord or WeChat!

## News and Updates +* 2023.9.25 🔥 We release both **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face. At the same time, we update **Qwen-7B** and **Qwen-7B-Chat**. Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. **PLEASE MAKE SURE YOU ARE USING THE LATEST CODES AND CHECKPOINTS!** * 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA. * 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation. * 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance. @@ -41,29 +52,31 @@ The following sections include information that you might find it helpful. Speci ## Performance -In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below. - -| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU | -|:------------------|:--------:|:--------:|:--------:|:---------:|:-------------:|:--------:| -| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - | -| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - | -| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 | -| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - | -| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 | -| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - | -| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - | -| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - | -| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** | +Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the context length is extended from 2048 to 8192) outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. However, even Qwen-14B still significantly fall behind GPT-3.5, let alone GPT-4. See the results below. -

- +

+


-Additionally, according to the third-party evaluation of large language models, conducted by [OpenCompass](https://opencompass.org.cn/leaderboard-llm), Qwen-7B and Qwen-7B-Chat are the top 7B-parameter models. This evaluation consists of a large amount of public benchmarks for the evaluation of language understanding and generation, coding, mathematics, reasoning, etc. - -For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md). +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 | +| Qwen-7B (original) | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | + +For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm). + +For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical report by clicking [here](TODO).

## Requirements @@ -76,7 +89,7 @@ For more experimental results (detailed model performance on more benchmark data ## Quickstart -Below, we provide simple examples to show how to use Qwen-7B with 🤖 ModelScope and 🤗 Transformers. +Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. @@ -98,13 +111,13 @@ Now you can start with ModelScope or Transformers. #### 🤗 Transformers -To use Qwen-7B-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.** +To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig -# Note: The default behavior now has injection attack prevention off. +# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # use bf16 @@ -144,15 +157,16 @@ print(response) # 《奋斗创业:一个年轻人的成功之路》 ``` -Running Qwen-7B pretrained base model is also simple. +Running Qwen pretrained base model is also simple.

- Running Qwen-7B + Running Qwen ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig +# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval() @@ -187,6 +201,7 @@ ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provid from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig +# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 @@ -200,16 +215,11 @@ print(response) ```
-## Tokenizer - -Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md). -

- ## Quantization ### Usage -**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.** +We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4) and Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4), which achieve nearly lossless model effects but improved performance on both memory costs and inference speed. Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: @@ -222,6 +232,7 @@ If you meet problems installing `auto-gptq`, we advise you to check out the offi Then you can load the quantized model easily and run inference as same as usual: ```python +# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-7B-Chat-Int4", device_map="auto", @@ -229,23 +240,28 @@ model = AutoModelForCausalLM.from_pretrained( ).eval() response, history = model.chat(tokenizer, "Hi", history=None) ``` + ### Performance We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: -| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | -|--------------|:----:|:-----------:|:-----:|:---------:| -| BF16 | 53.9 | 54.2 | 41.1 | 24.4 | -| Int4 | 52.6 | 52.9 | 38.1 | 23.8 | +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-7B-Chat (BF16) | 53.9 | 54.2 | 41.1 | 24.4 | +| Qwen-7B-Chat (Int4) | 52.6 | 52.9 | 38.1 | 23.8 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 61.0 | 43.9 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | ### Inference Speed We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively. -| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | -|--------------|:-------------------:|:-------------------:| -| BF16 | 30.34 | 29.32 | -| Int4 | 43.56 | 33.92 | +| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | +|----------------------|:-------------------:|:-------------------:| +| Qwen-7B-Chat (BF16) | 30.34 | 29.32 | +| Qwen-7B-Chat (Int4) | 43.56 | 33.92 | +| Qwen-14B-Chat (BF16) | 30.70 | 21.73 | +| Qwen-14B-Chat (Int4) | 37.11 | 26.11 | In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens. @@ -253,10 +269,12 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below. -| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | -|--------------|:-----------------------------------:|:-------------------------------------:| -| BF16 | 17.66GB | 22.58GB | -| Int4 | 8.21GB | 13.62GB | +| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | +|----------------------|:-----------------------------------:|:-------------------------------------:| +| Qwen-7B-Chat (BF16) | 17.66GB | 22.58GB | +| Qwen-7B-Chat (Int4) | 8.21GB | 13.62GB | +| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB | +| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB | The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).

@@ -438,7 +456,7 @@ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cp If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`: -```python[](https://) +```python from utils import load_model_on_gpus model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2) ``` @@ -448,52 +466,270 @@ Then you can run the 7B chat model on 2 GPUs using the above scripts. ## Tool Usage -Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance. +Qwen-Chat has been optimized for tool usage and function calling capabilities. Users can develop agents, LangChain applications, and even agument Qwen with a Python Code Interpreter. + +We provide documentation on how to implement tool calls based on the principle of ReAct Prompting, please refer to [the ReAct example](examples/react_prompt.md). Based on this principle, we provide support for function calling in [openai_api.py](openai_api.py). -| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | -|:-----------------|:----------------------:|:---------------------:|:---------------------:| -| GPT-4 | 95% | **0.90** | 15% | -| GPT-3.5 | 85% | 0.88 | 75% | -| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** | +We have tested the model's tool calling capabilities on our open-source Chinese evaluation benchmark and found that Qwen-Chat consistently performs well: -For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks. + + + + + + + + + + + + + + + + + + + +
Chinese Tool-Use Benchmark
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B-Chat98%0.917.3%
Qwen-14B-Chat98%0.932.4%
-Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows: +To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark). -| Model | Tool Selection↑ | Tool Used↑ | Code↑ | -|:-----------------|:---------------:|:----------:|:---------:| -| GPT-4 | **100** | **100** | **97.41** | -| GPT-3.5 | 95.37 | 96.30 | 87.04 | -| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | -| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | +We have observed that Qwen performs well in terms of code executability and result accuracy when generating code: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Executable Rate of Generated Code (%)
ModelMath↑Visualization↑General↑
GPT-491.985.982.8
GPT-3.589.265.074.1
LLaMA2-7B-Chat41.933.124.1
LLaMA2-13B-Chat50.040.548.3
CodeLLaMA-7B-Instruct85.154.070.7
CodeLLaMA-13B-Instruct93.255.874.1
InternLM-7B-Chat-v1.178.444.262.1
InternLM-20B-Chat70.344.265.5
Qwen-7B-Chat82.464.467.2
Qwen-14B-Chat89.284.165.5
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Accuracy of Code Execution Results (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑
GPT-482.866.760.8
GPT-3.547.333.355.7
LLaMA2-7B-Chat3.914.339.2
LLaMA2-13B-Chat8.38.340.5
CodeLLaMA-7B-Instruct14.326.260.8
CodeLLaMA-13B-Instruct28.227.462.0
InternLM-7B-Chat-v1.128.54.840.5
InternLM-20B-Chat34.621.445.6
Qwen-7B-Chat41.940.554.4
Qwen-14B-Chat58.453.659.5
+ +

+
+ +
+

+ +In addition, we also provide experimental results demonstrating that our model is capable of acting as a HuggingFace Agent. For more information, please refer to the [example documentation](examples/transformers_agent.md). The model's performance on the evaluation dataset provided by Hugging Face is as follows: + + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent Benchmark- Run Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-410010097.4
GPT-3.595.496.387.0
StarCoder-Base-15B86.187.068.9
StarCoder-15B87.088.068.9
Qwen-7B-Chat87.087.071.5
Qwen-14B-Chat93.594.487.0
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent Benchmark - Chat Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-497.997.998.5
GPT-3.597.396.889.6
StarCoder-Base-15B97.997.991.1
StarCoder-15B97.997.989.6
Qwen-7B-Chat94.794.785.1
Qwen-14B-Chat97.997.995.5

## Long-Context Understanding -To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen-7B can reach outstanding performance in the scenario of long context. Results are demonstrated below: +To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-7B/14B from 2k to over 8K tokens, and Qwen-7B from 8k to 32k tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below: - + - + + + + - + - + - + - + + + + + + + + + + +
ModelSequence LengthModelSequence Length
10242048409681921638410242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
Qwen-7B4.233.7839.35469.812645.09+ dynamic_ntk4.233.783.593.665.71-
+ dynamic_ntk4.233.783.593.665.71+ dynamic_ntk + logn4.233.783.583.564.62-
+ dynamic_ntk + logn4.233.783.583.564.62+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B-3.4622.79334.653168.35-
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
-
+ + +## Tokenizer + +Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md). +

## Reproduction @@ -507,10 +743,10 @@ If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to sear ## License Agreement -Researchers and developers are free to use the codes and model weights of both Qwen-7B and Qwen-7B-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply. +Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.

## Contact Us -If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com. +If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com. diff --git a/README_CN.md b/README_CN.md index 501d8ec..39922e3 100644 --- a/README_CN.md +++ b/README_CN.md @@ -4,32 +4,45 @@

- +


- Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  | Qwen-7B-Chat-Int4 🤗 + 🤗 Hugging Face   |   🤖 魔搭社区   |    📑 论文   |   🖥️ Demo
-WeChat   |   Discord   |   Demo  |  Report +微信   |    钉钉    |   Discord  



-我们在🤖 **ModelScope**以及🤗 **Hugging Face**均开源了**Qwen-7B**系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息,请点击[链接](tech_memo.md)查看我们的技术备忘录。 +| | Qwen-Chat | Qwen-Chat (Int4) | Qwen | +|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | -通义千问-7B(Qwen-7B) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括: +我们开源了**Qwen**(通义千问)系列工作,当前开源模型的参数规模为70亿(7B)和140亿(14B)。本次开源包括基础模型**Qwen**,即**Qwen-7B**和**Qwen-14B**,以及对话模型**Qwen-Chat**,即**Qwen-7B-Chat**和**Qwen-14B-Chat**。模型链接在表格中,请点击了解详情。 -1. **大规模高质量预训练数据**:我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型,覆盖通用领域和专业领域。 -2. **优秀的模型性能**:相比同规模的开源模型,Qwen-7B在多个评测数据集上具有显著优势,甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。 -3. **更好地支持多语言**:基于更大词表的分词器在分词上更高效,同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。 -4. **8K的上下文长度**:Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。 -5. **支持插件调用**:Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化,当前模型能有效调用插件以及升级为Agent。 +当前基础模型已经稳定训练了大规模高质量且多样化的数据,覆盖多语言(当前绝以中文和英文为主),总量高达3万亿token。在相关基准评测中,Qwen系列模型拿出非常有竞争力的表现,显著超出同规模模型并紧追一系列最强的闭源模型。此外,我们利用SFT和RLHF技术实现对齐,从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力,同时还具备一定的代码生成和简单数学推理的能力。在此基础上,我们针对LLM对接外部系统等方面针对性地做了优化,当前具备较强的工具调用能力,以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。 -以下章节的信息可能对你有帮助,建议阅读。如果你在使用过程遇到问题,建议先查询FAQ,如仍无法解决再提交issue。 +在这个项目中,你可以了解到以下内容 + +* 快速上手Qwen-Chat教程,玩转大模型推理. +* 量化模型相关细节,包括用法、显存占用、推理性能等。这部分还提供了和非量化模型的对比。 +* 微调的教程,帮你实现全参数微调、LoRA以及Q-LoRA。 +* 搭建Demo的方法,包括WebUI和CLI Demo +* 更多关于Qwen在工具调用、Code Interpreter、Agent方面的内容 +* 长序列理解能力及评测 +* 使用协议 +* ... + +如果遇到问题,请优先考虑查询[FAQ](FAQ.md)。如仍未解决,随时提出issue(但建议使用英语或提供翻译,有助于帮助更多用户)。如果想帮助我们提升,欢迎提交Pull Requests! + +想和我们一起讨论和聊天的话,赶紧加入我们的微信群和Discord server(入口见文档开头部分)!

## 新闻 +* 2023年9月25日 在魔搭社区(ModelScope)和Hugging Face同步推出Qwen-14B和Qwen-14B-Chat模型。 * 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调,其中包括全参数微调、LoRA以及Q-LoRA。 * 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型,Qwen-7B-Chat-Int4。该模型显存占用低,推理速度相比半精度模型显著提升,在基准评测上效果损失较小。 * 2023年8月3日 在魔搭社区(ModelScope)和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时,我们发布了技术备忘录,介绍了相关的训练细节和模型表现。 @@ -37,29 +50,32 @@ ## 评测表现 -Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、代码生成等能力的评测数据集上,包括MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU等,均超出了同规模大语言模型的表现,甚至超出了如12-13B参数等更大规模的语言模型。 - -| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU | -| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: | -| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - | -| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - | -| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 | -| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - | -| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 | -| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - | -| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - | -| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - | -| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** | +Qwen-14B及Qwen-7B (最新版本使用更大量的token进行预训练)相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集,考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。当然,即便Qwen-14B相比GPT-3.5和GPT-4仍有差距。 -

- +

+


-此外,根据[OpenCompass](https://opencompass.org.cn/leaderboard-llm)进行的大型语言模型第三方评估,Qwen-7B 和 Qwen-7B-Chat 是其中表现最优的7B参数模型。该评估由大量公开基准组成,用于评估语言理解和生成、代码生成、数学、推理等。 - -更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。 +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:-----------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 | +| **Qwen-7B (original)** | 56.7 | 59.6 | 51.6 | 10.4 | 24.4 | 31.2 | 40.6 | 58.8 | +| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | + + +对于以上所有对比模型,我们列出了其官方汇报结果与[OpenCompass](https://opencompass.org.cn/leaderboard-llm)结果之间的最佳分数。 + +更多的实验结果和细节请查看我们的技术备忘录。点击[这里](TODO)。

## 要求 @@ -93,13 +109,13 @@ cd flash-attention && pip install . #### 🤗 Transformers -如希望使用Qwen-7B-chat进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。** +如希望使用Qwen-chat进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码,并指定正确的模型名称和路径,如`Qwen/Qwen-7B-Chat`和`Qwen/Qwen-14B-Chat`** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig -# 请注意:分词器默认行为已更改为默认关闭特殊token攻击防护。 +# 可选的模型包括: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 @@ -135,15 +151,16 @@ print(response) # 《奋斗创业:一个年轻人的成功之路》 ``` -运行Qwen-7B同样非常简单。 +运行Qwen同样非常简单。

- 运行Qwen-7B + 运行Qwen ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig +# 可选的模型包括: "Qwen/Qwen-7B", "Qwen/Qwen-14B" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True) # 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 @@ -175,6 +192,7 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig +# 可选的模型包括: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 @@ -188,18 +206,11 @@ print(response) ```
-## Tokenization - -> 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 - -基于tiktoken的tokenizer有别于其他分词器,比如sentencepiece tokenizer。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅[文档](tokenization_note_zh.md)。 -

- ## 量化 ### 用法 -**请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化,提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案,该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。** +**请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化,提供Int4量化模型,包括Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)和Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4)。该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。** 以下我们提供示例说明如何使用Int4量化模型。在开始使用前,请先保证满足要求(如torch 2.0及以上,transformers版本为4.32.0及以上,等等),并安装所需安装包: @@ -212,6 +223,7 @@ pip install auto-gptq optimum 随后即可使用和上述一致的用法调用量化模型: ```python +# 可选模型包括:"Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-7B-Chat-Int4", device_map="auto", @@ -223,19 +235,23 @@ response, history = model.chat(tokenizer, "Hi", history=None) 我们对BF16和Int4模型在基准评测上做了测试,发现量化模型效果损失较小,结果如下所示: -| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | -| ------------- | :--------: | :----------: | :----: | :--------: | -| BF16 | 53.9 | 54.2 | 41.1 | 24.4 | -| Int4 | 52.6 | 52.9 | 38.1 | 23.8 | +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-7B-Chat (BF16) | 53.9 | 54.2 | 41.1 | 24.4 | +| Qwen-7B-Chat (Int4) | 52.6 | 52.9 | 38.1 | 23.8 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 61.0 | 43.9 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | ### 推理速度 我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度(tokens/s)。如图所示: -| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | -| ------------- | :------------------:| :------------------:| -| BF16 | 30.34 | 29.32 | -| Int4 | 43.56 | 33.92 | +| Quantization | Speed (2048 tokens) | Speed (8192 tokens) | +|----------------------|:-------------------:|:-------------------:| +| Qwen-7B-Chat (BF16) | 30.34 | 29.32 | +| Qwen-7B-Chat (Int4) | 43.56 | 33.92 | +| Qwen-14B-Chat (BF16) | 30.70 | 21.73 | +| Qwen-14B-Chat (Int4) | 37.11 | 26.11 | 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。 @@ -243,10 +259,12 @@ response, history = model.chat(tokenizer, "Hi", history=None) 我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示: -| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | -| ------------------ | :---------------------------------: | :-----------------------------------: | -| BF16 | 17.66GB | 22.58GB | -| Int4 | 8.21GB | 13.62GB | +| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | +|----------------------|:-----------------------------------:|:-------------------------------------:| +| Qwen-7B-Chat (BF16) | 17.66GB | 22.58GB | +| Qwen-7B-Chat (Int4) | 8.21GB | 13.62GB | +| Qwen-14B-Chat (BF16) | 30.15GB | 38.94GB | +| Qwen-14B-Chat (Int4) | 13.00GB | 21.79GB | 上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。

@@ -439,54 +457,272 @@ model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2) ## 工具调用 -Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。在我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力,并发现Qwen-7B-Chat能够取得稳定的表现。 +Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以开发基于Qwen的Agent、LangChain应用、甚至Code Interpreter。 -| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | -|:-----------------|:----------------------:|:----------------------:|:----------------------:| -| GPT-4 | 95% | **0.90** | 15% | -| GPT-3.5 | 85% | 0.88 | 75% | -| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** | +我们提供了文档说明如何根据ReAct Prompting的原理实现工具调用,请参见[ReAct示例](examples/react_prompt.md)。基于该原理,我们在 [openai_api.py](openai_api.py) 里提供了函数调用(Function Calling)的支持。 +我们在已开源的中文[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力,并发现Qwen-Chat能够取得稳定的表现: -我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。 + + + + + + + + + + + + + + + + + + + +
中文工具调用评测基准
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B-Chat98%0.917.3%
Qwen-14B-Chat98%0.932.4%
-For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md)。 +为了考察Qwen使用Python Code Interpreter完成数学解题、数据可视化、及文件处理与爬虫等任务的能力,我们专门建设并开源了一个评测这方面能力的[评测基准](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark)。 +我们发现Qwen在生成代码的可执行率、结果正确性上均表现较好: -此外,我们还提供了实验结果表明我们的模型扮演Agent的能力。请阅读相关文档[链接](https://huggingface.co/docs/transformers/transformers_agents)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
生成代码的可执行率 (%)
ModelMath↑Visualization↑General↑
GPT-491.985.982.8
GPT-3.589.265.074.1
LLaMA2-7B-Chat41.933.124.1
LLaMA2-13B-Chat50.040.548.3
CodeLLaMA-7B-Instruct85.154.070.7
CodeLLaMA-13B-Instruct93.255.874.1
InternLM-7B-Chat-v1.178.444.262.1
InternLM-20B-Chat70.344.265.5
Qwen-7B-Chat82.464.467.2
Qwen-14B-Chat89.284.165.5
-| Model | Tool Selection↑ | Tool Used↑ | Code↑ | -|:-----------------|:---------------:|:-----------:|:---------:| -| GPT-4 | **100** | **100** | **97.41** | -| GPT-3.5 | 95.37 | 96.30 | 87.04 | -| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | -| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
代码执行结果的正确率 (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑
GPT-482.866.760.8
GPT-3.547.333.355.7
LLaMA2-7B-Chat3.914.339.2
LLaMA2-13B-Chat8.38.340.5
CodeLLaMA-7B-Instruct14.326.260.8
CodeLLaMA-13B-Instruct28.227.462.0
InternLM-7B-Chat-v1.128.54.840.5
InternLM-20B-Chat34.621.445.6
Qwen-7B-Chat41.940.554.4
Qwen-14B-Chat58.453.659.5
+ +

+
+ +
+

+ +此外,我们还提供了实验结果表明我们的模型具备扮演HuggingFace Agent的能力,详见[示例文档](examples/transformers_agent.md)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下: + + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent评测基准 - Run模式
ModelTool Selection↑Tool Used↑Code↑
GPT-410010097.4
GPT-3.595.496.387.0
StarCoder-Base-15B86.187.068.9
StarCoder-15B87.088.068.9
Qwen-7B-Chat87.087.071.5
Qwen-14B-Chat93.594.487.0
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
HuggingFace Agent评测基准 - Chat模式
ModelTool Selection↑Tool Used↑Code↑
GPT-497.997.998.5
GPT-3.597.396.889.6
StarCoder-Base-15B97.997.991.1
StarCoder-15B97.997.989.6
Qwen-7B-Chat94.794.785.1
Qwen-14B-Chat97.997.995.5

## 长文本理解 -我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。我们的模型已经突破8K的序列长度。通过arXiv数据集上的语言模型实验,我们发现Qwen-7B能够在长序列的设置下取得不错的表现。 +我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。通过arXiv数据集上的语言模型实验,我们的原生长度为2K的Qwen-7B/14B在8K的序列长度下依然表现不错,而原生长度扩展到8K的Qwen-7B能够在32K长序列的设置下取得不错的表现。 - + - + + + + - + - + - + - + + + + + + + + + + + + + +
ModelSequence LengthModelSequence Length
10242048409681921638410242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
Qwen-7B4.233.7839.35469.812645.09+ dynamic_ntk4.233.783.593.665.71-
+ dynamic_ntk4.233.783.593.665.71+ dynamic_ntk + logn4.233.783.583.564.62-
+ dynamic_ntk + logn4.233.783.583.564.62+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk4.233.813.523.313.233.33
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B-3.4622.79334.653168.35-
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
-
+ +## Tokenization + +> 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 + +基于tiktoken的tokenizer有别于其他分词器,比如sentencepiece tokenizer。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅[文档](tokenization_note_zh.md)。 +

## 复现 @@ -500,10 +736,10 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct ## 使用协议 -研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。 +研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。

## 联系我们 -如果你想给我们的研发团队和产品团队留言,请通过邮件(qianwen_opensource@alibabacloud.com)联系我们。 +如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群和Discord server。当然也可以通过邮件(qianwen_opensource@alibabacloud.com)联系我们。 diff --git a/README_JA.md b/README_JA.md index 6490fd5..f1108da 100644 --- a/README_JA.md +++ b/README_JA.md @@ -35,6 +35,7 @@ Qwen-7B は、アリババクラウドが提唱する大規模言語モデルシ ## ニュースとアップデート +* 2023.9.25 ModelScope と Hugging Face 上で **Qwen-14B** と **Qwen-14B-Chat** をリリースしました。 * 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。 * 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル **Qwen-7B-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きな性能低下は見られませんでした。 * 2023.8.3 ModelScope と Hugging Face 上で **Qwen-7B** と **Qwen-7B-Chat** をリリースしました。また、トレーニングの詳細やモデルの性能など、モデルの詳細については技術メモを提供しています。 @@ -42,29 +43,27 @@ Qwen-7B は、アリババクラウドが提唱する大規模言語モデルシ ## 性能 -Qwen-7B は、MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU など、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのベースラインモデルを凌駕しており、さらには 13B 程度のパラメータを持つより大規模なモデルをも凌駕しています。以下の結果をご覧ください。 - -| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU | -| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: | -| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - | -| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - | -| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 | -| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 | -| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - | -| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 | -| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - | -| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - | -| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - | -| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** | +Qwen-14B は、MMLU、C-Eval、GSM8K、HumanEval、CMMLU など、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのベースラインモデルを凌駕しており。以下の結果をご覧ください。 + +| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | +|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:| +| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | +| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | +| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | +| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | +| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | +| InternLM-7B | 51.0 | 52.8 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | +| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | +| Baichuan2-7B | 54.2 | 54.0 | 24.5 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | +| Baichuan2-13B | 59.2 | 58.1 | 52.8 | 10.1 | 17.1 | 30.2 | 48.8 | 62.0 | +| **Qwen-7B** | 56.7 | 59.6 | 51.6 | - | 24.4 | 31.2 | 40.6 | 58.8 | +| **Qwen-7B v1.1** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | +| **Qwen-14B** | **66.3** | **72.1** | **61.3** | **24.8** | **32.3** | **40.8** | **53.4** | **71.0** | -

- -

-
-さらに、[OpenCompass](https://opencompass.org.cn/leaderboard-llm) が実施した大規模言語モデルの第三者評価によると、Qwen-7B と Qwen-7B-Chat は 7B パラメータモデルのトップになります。この評価は、言語理解・生成、コーディング、数学、推論などの評価のための大量の公開ベンチマークで構成されています。 +比較されたすべてのモデルについて、公式に報告された結果と[OpenCompass](https://opencompass.org.cn/leaderboard-llm) の間の最高スコアを報告します。 -より詳細な実験結果(より多くのベンチマークデータセットでの詳細なモデル性能)や詳細については、[こちら](tech_memo.md)をクリックして技術メモを参照してください。 +より詳細な実験結果(より多くのベンチマークデータセットでの詳細なモデル性能)や詳細については、[こちら](TODO)をクリックして技術メモを参照してください。

## 必要条件 @@ -442,22 +441,217 @@ model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2) Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。 -| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ | -|:-----------------|:----------------------:|:----------------------:|:----------------------:| -| GPT-4 | 95% | **0.90** | 15% | -| GPT-3.5 | 85% | 0.88 | 75% | -| **Qwen-7B-Chat** | **99%** | 0.89 | **9.7%** | + + + + + + + + + + + + + + Qwen-7B-Chat v1.1 + + + + + +
Chinese Tool-Use Benchmark
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B-Chat v1.198%0.917.3%
Qwen-14B-Chat98%0.932.4%
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Qwen-7B-Chat v1.1 + + + + + + + + + + + +
Using Code Interpreter - Executable Rate of Generated Code (%)
ModelMath↑Visualization↑General↑
GPT-491.985.982.8
GPT-3.589.265.074.1
LLaMA2-7B-Chat41.933.124.1
LLaMA2-13B-Chat50.040.548.3
CodeLLaMA-7B-Instruct85.154.070.7
CodeLLaMA-13B-Instruct93.255.874.1
InternLM-7B-Chat-v1.178.444.262.1
InternLM-20B-Chat70.344.265.5
Qwen-7B-Chat v1.182.464.467.2
Qwen-14B-Chat89.284.165.5
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Qwen-7B-Chat v1.1 + + + + + + + + + + + +
Using Code Interpreter - Accuracy of Code Execution Results (%)
ModelMath↑Visualization-Hard↑Visualization-Easy↑
GPT-482.866.760.8
GPT-3.547.333.355.7
LLaMA2-7B-Chat3.914.339.2
LLaMA2-13B-Chat8.38.340.5
CodeLLaMA-7B-Instruct14.326.260.8
CodeLLaMA-13B-Instruct28.227.462.0
InternLM-7B-Chat-v1.128.54.840.5
InternLM-20B-Chat34.621.445.6
Qwen-7B-Chat v1.141.940.554.4
Qwen-14B-Chat58.453.659.5
+ ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。 -さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです: +

+
+ +
+

+ +さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](examples/transformers_agent.md) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです: + + + + + + + + + + + + + + + + + + + + + Qwen-7B-Chat v1.1 + + + + + +
HuggingFace Agent Benchmark- Run Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-410010097.4
GPT-3.595.496.387.0
StarCoder-Base-15B86.187.068.9
StarCoder-15B87.088.068.9
Qwen-7B-Chat v1.187.087.071.5
Qwen-14B-Chat93.594.487.0
-| Model | Tool Selection↑ | Tool Used↑ | Code↑ | -|:-----------------|:---------------:|:-----------:|:---------:| -| GPT-4 | **100** | **100** | **97.41** | -| GPT-3.5 | 95.37 | 96.30 | 87.04 | -| StarCoder-15.5B | 87.04 | 87.96 | 68.89 | -| **Qwen-7B-Chat** | 90.74 | 92.59 | 74.07 | + + + + + + + + + + + + + + + + + + + + Qwen-7B-Chat v1.1 + + + + + +
HuggingFace Agent Benchmark - Chat Mode
ModelTool Selection↑Tool Used↑Code↑
GPT-497.997.998.5
GPT-3.597.396.889.6
StarCoder-Base-15B97.997.991.1
StarCoder-15B97.997.989.6
Qwen-7B-Chat v1.194.794.785.1
Qwen-14B-Chat97.997.995.5

@@ -467,25 +661,40 @@ ReAct プロンプトの書き方や使い方については、[ReAct の例](ex - + - + + + + + + + - + - + - + + - + + + + + + + + + +
ModelSequence LengthModelSequence Length
10242048409681921638410242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
+ dynamic_ntk4.233.783.593.665.71-
Qwen-7B4.233.7839.35469.812645.09+ dynamic_ntk + logn4.233.783.583.564.62-
+ dynamic_ntk4.233.783.593.665.71+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
+ dynamic_ntk + logn4.233.783.583.564.62
Qwen-7B v1.14.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32+ dynamic_ntk4.233.813.523.313.233.33
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B-3.4622.79334.653168.35-
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-
-
## 再現 diff --git a/assets/qwen_tokenizer.png b/assets/qwen_tokenizer.png index a6b0366..f30794e 100644 Binary files a/assets/qwen_tokenizer.png and b/assets/qwen_tokenizer.png differ diff --git a/assets/tokenizer.pdf b/assets/tokenizer.pdf index f33e7e5..2cd6363 100644 Binary files a/assets/tokenizer.pdf and b/assets/tokenizer.pdf differ diff --git a/assets/tokenizer.png b/assets/tokenizer.png index b16c0cd..f30794e 100644 Binary files a/assets/tokenizer.png and b/assets/tokenizer.png differ diff --git a/eval/EVALUATION.md b/eval/EVALUATION.md index 75ea021..5baeb4d 100644 --- a/eval/EVALUATION.md +++ b/eval/EVALUATION.md @@ -12,7 +12,7 @@ cd ../../ # Qwen-7B python evaluate_ceval.py -d data/ceval/ -# Qwen-7B-Chat +# Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).) pip install thefuzz python evaluate_chat_ceval.py -d data/ceval/ ``` @@ -29,7 +29,7 @@ cd ../../ # Qwen-7B python evaluate_mmlu.py -d data/mmlu/data/ -# Qwen-7B-Chat +# Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).) pip install thefuzz python evaluate_chat_mmlu.py -d data/mmlu/data/ ``` @@ -73,9 +73,8 @@ This program exists to run untrusted model-generated code. Users are strongly en # Qwen-7B python evaluate_gsm8k.py -# Qwen-7B-Chat +# Qwen-7B-Chat (We only provide 0-shot reproduction scripts. 5-shot results are obtained by OpenCompass (https://github.com/InternLM/opencompass).) python evaluate_chat_gsm8k.py # zeroshot -python evaluate_chat_gsm8k.py --use-fewshot # fewshot ``` - PLUGIN diff --git a/eval/evaluate_chat_humaneval.py b/eval/evaluate_chat_humaneval.py index 66dcec8..54ceca8 100644 --- a/eval/evaluate_chat_humaneval.py +++ b/eval/evaluate_chat_humaneval.py @@ -1,4 +1,3 @@ - import re import textwrap import argparse @@ -19,6 +18,7 @@ evaluate_functional_correctness HumanEval_res.jsonl DEVICE = "cuda:0" + def extract_code(text, entry_point): # 正则表达式匹配代码块 code_block_pattern = re.compile( @@ -99,7 +99,26 @@ if __name__ == "__main__": f = jsonlines.open(args.sample_input_file) with f_output as output: for jobj in tqdm.tqdm(f, desc="task_idx"): - prompt = "Help me fill the following code.\n" + jobj["prompt"] + # use humanevalpack prompt + signature = re.search( + rf"def\s+({jobj['entry_point']}.*?):\s*\n", jobj["prompt"] + ).group(1) + description = "\n".join( + [ + line.strip() + for line in re.search( + rf"(?:\"\"\"|''')(.*?)(?:\"\"\"|''')", jobj["prompt"], re.DOTALL + ) + .group(1) + .split("\n") + ] + ) + prompt = ( + f"Write a Python function `{signature}` to solve the following problem:\n" + f"{description}\n" + f"{jobj['prompt']}" + ) + task_id = jobj["task_id"] answer, response = generate_sample( model, tokenizer, prompt, jobj["entry_point"] diff --git a/openai_api.py b/openai_api.py index f6683f8..814cff9 100644 --- a/openai_api.py +++ b/openai_api.py @@ -321,7 +321,7 @@ def parse_response(response): # completion mode, not chat mode -def text_complete_last_message(history, stop_words_ids): +def text_complete_last_message(history, stop_words_ids, gen_kwargs): im_start = "<|im_start|>" im_end = "<|im_end|>" prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}" @@ -339,7 +339,7 @@ def text_complete_last_message(history, stop_words_ids): stop_words_ids = _stop_words_ids input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device) - output = model.generate(input_ids, stop_words_ids=stop_words_ids).tolist()[0] + output = model.generate(input_ids, stop_words_ids=stop_words_ids, **gen_kwargs).tolist()[0] output = tokenizer.decode(output, errors="ignore") assert output.startswith(prompt) output = output[len(prompt) :] @@ -352,6 +352,16 @@ def text_complete_last_message(history, stop_words_ids): async def create_chat_completion(request: ChatCompletionRequest): global model, tokenizer + gen_kwargs = {} + if request.temperature is not None: + if request.temperature < 0.01: + gen_kwargs['top_k'] = 1 # greedy decoding + else: + # Not recommended. Please tune top_p instead. + gen_kwargs['temperature'] = request.temperature + if request.top_p is not None: + gen_kwargs['top_p'] = request.top_p + stop_words = add_extra_stop_words(request.stop) if request.functions: stop_words = stop_words or [] @@ -366,12 +376,12 @@ async def create_chat_completion(request: ChatCompletionRequest): status_code=400, detail="Invalid request: Function calling is not yet implemented for stream mode.", ) - generate = predict(query, history, request.model, stop_words) + generate = predict(query, history, request.model, stop_words, gen_kwargs) return EventSourceResponse(generate, media_type="text/event-stream") stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None if query is _TEXT_COMPLETION_CMD: - response = text_complete_last_message(history, stop_words_ids=stop_words_ids) + response = text_complete_last_message(history, stop_words_ids=stop_words_ids, gen_kwargs=gen_kwargs) else: response, _ = model.chat( tokenizer, @@ -379,6 +389,7 @@ async def create_chat_completion(request: ChatCompletionRequest): history=history, stop_words_ids=stop_words_ids, append_history=False, + **gen_kwargs ) print(f"\n{history}\n{query}\n\n{response}\n") response = trim_stop_words(response, stop_words) @@ -396,7 +407,7 @@ async def create_chat_completion(request: ChatCompletionRequest): async def predict( - query: str, history: List[List[str]], model_id: str, stop_words: List[str] + query: str, history: List[List[str]], model_id: str, stop_words: List[str], gen_kwargs: Dict, ): global model, tokenizer choice_data = ChatCompletionResponseStreamChoice( @@ -416,7 +427,7 @@ async def predict( detail="Invalid request: custom stop words are not yet supported for stream mode.", ) response_generator = model.chat_stream( - tokenizer, query, history=history, stop_words_ids=stop_words_ids + tokenizer, query, history=history, stop_words_ids=stop_words_ids, **gen_kwargs ) for new_response in response_generator: if len(new_response) == current_length: