update new version of quantization and inference efficiency profiling result

main
yangapku 1 year ago
parent 8310e25513
commit 04f896f7d4

@ -1,4 +1,4 @@
<br>
span
<p align="center">
<img src="assets/logo.jpg" width="400"/>
@ -6,7 +6,7 @@
<br>
<p align="center">
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
</p>
<br>
@ -27,26 +27,27 @@ Qwen-7B is the 7B-parameter version of the large language model series, Qwen (ab
The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.
## News
## News and Updates
* 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
## Performance
In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below.
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
| :------------- | :--------: | :--------: | :--------: | :---------: | :-------------: | :--------: |
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
<p align="center">
<img src="assets/performance.png" width="1000"/>
@ -195,93 +196,65 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
## Quantization
We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
### Usage
```
**Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
```
**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**
Then run the following command to install `bitsandbytes`:
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
```
pip install bitsandbytes
```
Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
Then you only need to add your quantization configuration to `AutoModelForCausalLM.from_pretrained`. See the example below:
Then you can load the quantized model easily as shown below:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# quantization configuration for NF4 (4 bits)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
```
# quantization configuration for Int8 (8 bits)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map="cuda:0",
quantization_config=quantization_config,
max_memory=max_memory,
trust_remote_code=True,
).eval()
```python
from transformers import GenerationConfig
config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
```
With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.
### Performance
| Precision | MMLU | GPU Memory for Loading Model |
| ----------- | :------: | :---------------------------: |
| BF16 | 56.7 | 16.38G |
| Int8 | 52.8 | 10.44G |
| NF4 | 48.9 | 7.79G |
We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and CUDA 11.8, with flash attention used.
## Inference Efficiency
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
| -------------- | :----: | :-----------: | :-----: | :---------: |
| BF16 | 53.9 | 54.2 | 41.1 | 24.4 |
| Int4 | 52.6 | 52.9 | 38.1 | 23.8 |
### Inference Speed
We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.
| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
| ---------------------- | :----------------------------------------: | :---------------------------------------: |
| BF16 (no quantization) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| -------------- | :-------------------: | :-------------------: |
| BF16 | 30.53 | 28.51 |
| Int4 | 45.60 | 33.83 |
In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
### GPU Memory Usage
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below.
When using flash attention, the memory usage is:
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |
When not using flash attention, the memory usage is:
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| -------------- | :-----------------------------------: | :-------------------------------------: |
| BF16 | 18.99GB | 24.40GB |
| In4 | 10.20GB | 15.61GB |
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
## Demo
### Web UI
We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
@ -371,22 +344,22 @@ print(response.choices[0].message.content)
Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
|:------------|:----------------------:|:----------------------:|:----------------------:|
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
| :------------ | :-----------------------: | :----------------------: | :----------------------: |
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.
Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows:
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
|:---------------|:---------------:|:-----------:|:---------:|
|GPT-4 | **100** | **100** | **97.41** |
|GPT-3.5 | 95.37 | 96.30 | 87.04 |
|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
| :---------------- | :----------------: | :-----------: | :---------: |
| GPT-4 | **100** | **100** | **97.41** |
| GPT-3.5 | 95.37 | 96.30 | 87.04 |
| StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
## Long-Context Understanding

@ -6,7 +6,7 @@
<br>
<p align="center">
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
</p>
<br>
@ -29,6 +29,8 @@
## 新闻
* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型Qwen-7B-Chat-Int4。该模型显存占用低推理速度相比半精度模型显著提升在基准评测上效果损失较小。
* 2023年8月3日 在魔搭社区ModelScope和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时我们发布了技术备忘录介绍了相关的训练细节和模型表现。
## 评测表现
@ -198,89 +200,62 @@ print(f'Response: {response}')
## 量化
如希望使用更低精度的量化模型如4比特和8比特的模型我们提供了简单的示例来说明如何快速使用量化模型。在开始前确保你已经安装了`bitsandbytes`。请注意,`bitsandbytes`的安装要求是:
### 用法
```
**Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
```
**请注意:我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案,该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。**
随后运行如下命令安装`bitsandbytes`:
以下我们提供示例说明如何使用Int4量化模型。在开始使用前请先保证满足AutoGPTQ的要求并使用源代码安装由于最新支持Qwen的代码未发布到PyPI
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
```
pip install bitsandbytes
```
Windows用户需安装特定版本的`bitsandbytes`,可选项包括[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。
你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置,即可使用量化模型。如下所示
随后便能轻松读取量化模型:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# quantization configuration for NF4 (4 bits)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
```
# quantization configuration for Int8 (8 bits)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
推理方法和基础用法类似但注意需要从外部传入generation config
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map="cuda:0",
quantization_config=quantization_config,
max_memory=max_memory,
trust_remote_code=True,
).eval()
```python
from transformers import GenerationConfig
config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
```
上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取,帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失,但模型的显存开销大幅降低。
### 效果评测
| Precision | MMLU | GPU Memory for Loading Model |
| ----------- | :------: | :---------------------------: |
| BF16 | 56.7 | 16.38G |
| Int8 | 52.8 | 10.44G |
| NF4 | 48.9 | 7.79G |
我们对BF16和Int4模型在基准评测上做了测试发现量化模型效果损失较小结果如下所示
表中显存占用的测试环境为A100-SXM4-80G单卡PyTorch 2.0.1CUDA 11.8开启flash attention
## 推理性能
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
| ------------- | :--------: | :----------: | :----: | :--------: |
| BF16 | 53.9 | 54.2 | 41.1 | 24.4 |
| Int4 | 52.6 | 52.9 | 38.1 | 23.8 |
### 推理速度
我们分别测试了BF16和量化条件下模型生成2K tokens的平均推理速度结果如下
| 量化等级 | 开flash_attn的推理速度 (字符/秒) | 关flash_attn的推理速度 (字符/秒) |
| ------ | :---------------------------: | :---------------------------: |
| BF16 (无量化) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |
我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度tokens/s。如图所示
具体的评测方式为指定输入context长度为1生成长度为2048测试硬件为A100-SXM4-80G单卡软件环境为PyTorch 2.0.1CUDA版本11.8计算生成该2048序列的平均速度
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------- | :------------------:| :------------------:|
| BF16 | 30.53 | 28.51 |
| Int4 | 45.60 | 33.83 |
### 显存占用
具体而言我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
在BF16和不同量化条件下我们分别测算了模型编码2048长度序列并生成1个token和生成8192长度序列编码1个token作为context的峰值显存占用。结果如下
### 显存使用
打开flash attention时
我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示
| 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
| --- | :---: | :---: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.99GB | 24.40GB |
| In4 | 10.20GB | 15.61GB |
关闭flash attention时
| 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
| --- | :---: | :---: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |
以上测速和显存占用情况,均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。
上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
## Demo
@ -304,7 +279,6 @@ python web_demo.py
<br>
<p>
### 交互式Demo
我们提供了一个简单的交互式Demo示例请查看`cli_demo.py`。当前模型已经支持流式输出用户可通过输入文字的方式和Qwen-7B-Chat交互模型将流式输出返回结果。运行如下命令

@ -6,7 +6,7 @@
<br>
<p align="center">
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>&nbsp &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
</p>
<br>
@ -33,6 +33,8 @@ Qwen-7Bは、アリババクラウドが提唱する大規模言語モデルシ
## ニュース
* 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル(**Qwen-7B-Chat-Int4**)をリリースしました。メモリコストは低いが、推論速度は向上している。また、ベンチマーク評価において大きな性能劣化はありません。
* 2023.8.3 Qwen-7B と Qwen-7B-Chat を ModelScope と Hugging Face で公開。また、トレーニングの詳細やモデルの性能など、モデルの詳細についてはテクニカルメモを提供しています。
## パフォーマンス
@ -199,89 +201,62 @@ tiktoken に基づくトークナイザーは、他のトークナイザー、
## 量子化
`NF4``Int8` のモデルをロードする方法を示す例を提供します。手始めに、`bitsandbytes` が実装されていることを確認して下さい。`bitsandbytes` の要件は以下の通りになります:
### 使用方法
```
**必要条件** Python >= 3.8。Linux ディストリビューションUbuntu、MacOS など)+ CUDA > 10.0。
```
**注:[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づく新しい解決策を提供し、Qwen-7B-Chat用のInt4量子化モデル[ここをクリック](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)をリリースしました。このモデルは、従来の解決策と比較して、ほぼ無損失のモデル効果を達成しつつ、メモリコストと推論速度の両方で性能が向上しています**。
そして、以下のコマンドを実行して `bitsandbytes` をインストールする
ここでは、量子化されたモデルを推論に使用する方法を示します。始める前に、AutoGPTQの要件を満たしていることを確認し、ソースからインストールしてください一時的にQwenのコードは最新版のPyPIパッケージではまだリリースされていません
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
```
pip install bitsandbytes
```
Windows ユーザは、[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) という別のオプションを見つける必要があります。
して、量子化の設定を `AutoModelForCausalLM.from_pretrained` に追加するだけとなります。以下の例を参照してください:
そうすれば、以下のように簡単に量子化モデルを読み込むことができる。
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# NF44ビットの量子化設定
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
```
# Int88ビットの量子化設定
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
推論を実行するには、上で示した基本的な使い方に似ているが、generation configurationを明示的に渡すことを忘れないこと
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map="cuda:0",
quantization_config=quantization_config,
max_memory=max_memory,
trust_remote_code=True,
).eval()
```python
from transformers import GenerationConfig
config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
```
この方法では、Qwen-7B を `NF4``Int8` でロードすることができ、メモリ使用量を節約できる。以下にモデル性能の関連統計量を示します。量子化により、有効性は若干低下するが、推論効率は大幅に向上し、メモリコストが削減されることがわかります。
### 性能
| Precision | MMLU | GPU Memory for Loading Model |
| ----------- | :------: | :---------------------------: |
| BF16 | 56.7 | 16.38G |
| Int8 | 52.8 | 10.44G |
| NF4 | 48.9 | 7.79G |
ベンチマークにおけるBF16モデルとInt4モデルの性能について説明する。結果を以下に示します
上表のGPUメモリ使用量プロファイリングは、シングルA100-SXM4-80G GPU、PyTorch 2.0.1、CUDA 11.8、フラッシュアテンション使用で実行されています。
## 推論効率
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
| ------------- | :--------: | :----------: | :----: | :--------: |
| BF16 | 53.9 | 54.2 | 41.1 | 24.4 |
| Int4 | 52.6 | 52.9 | 38.1 | 23.8 |
### 推論スピード
BF16精度、量子化レベルInt8またはNF4で、それぞれ2Kトークンを生成する平均推論速度を測定した。
BF16の精度とInt4の量子化レベルの下で、それぞれ2048個と8192個のトークンを生成する平均推論速度(tokens/s)を測定した。
| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
| ------ | :---------------------------: | :---------------------------: |
| BF16 (no quantization) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------- | :------------------:| :------------------:|
| BF16 | 30.53 | 28.51 |
| Int4 | 45.60 | 33.83 |
詳細には、プロファイリングの設定は、1コンテクスト・トークンで2048の新しいトークンを生成している。プロファイリングは、PyTorch 2.0.1とCUDA 11.8を搭載したシングルA100-SXM4-80G GPUで実行される。推論速度は生成された2048個のトークンの平均です。
詳細には、プロファイリングの設定は、1コンテクスト・トークンで8192個の新しいトークンを生成している。プロファイリングは、PyTorch 2.0.1とCUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行される。推論速度は生成された8192個のトークンの平均です。
### GPUメモリ使用量
また、BF16またはInt8/NF4量子化レベルの下で、2048個のトークンをコンテキストとしてエンコードした場合および単一のトークンを生成した場合と、8192個のトークンを生成した場合単一のトークンをコンテキストとして生成した場合のGPUメモリ使用量のピーク値をそれぞれプロファイリングしました。結果を以下に示す。
Flash attentionを使用した場合のメモリ使用量は以下の通りである
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |
Flash attentionを使用しない場合、メモリ使用量は次のようになる
また、BF16またはInt4の量子化レベルで、それぞれ2048トークンをコンテキストとしてエンコードした場合および単一のトークンを生成した場合と、8192トークンを生成した場合単一のトークンをコンテキストとして生成した場合のGPUメモリ使用量のピーク値をプロファイリングしました。その結果を以下に示します。
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.99GB | 24.40GB |
| In4 | 10.20GB | 15.61GB |
上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)を使って行われた
上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)を使用しています。
## デモ

Loading…
Cancel
Save