update kvcache

main
yangapku 1 year ago
parent eb6e364fe7
commit 26da1a2f9d

@ -44,8 +44,8 @@ Would like to chat with us or date us coffee time? Welcome to our Discord or WeC
## News and Updates
* 2023.9.25 🔥 We release [qwen.cpp](https://github.com/QwenLM/qwen.cpp), a C++ implementation of Qwen-LM.
* 2023.9.25 🔥 We release both **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face. At the same time, we update **Qwen-7B** and **Qwen-7B-Chat**. Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. **PLEASE MAKE SURE YOU ARE USING THE LATEST CODES AND CHECKPOINTS!**
* 2023.9.25 🔥 We release **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face, along with [qwen.cpp](https://github.com/QwenLM/qwen.cpp) and [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). Codes and checkpoints of **Qwen-7B** and **Qwen-7B-Chat** are also updated. **PLEASE PULL THE LATEST VERSION!**
- Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved.
* 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA.
* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
@ -280,6 +280,70 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
<br><br>
## Quantization of KV cache
Attention KV cache can be quantized and compressed for storage, to get a higher sample throughput.
### Usage
The parameters of 'use_cache_quantization' and 'use_cache_kernel' are provided to control kv-cache-quantization behavior
When use_cache_quantization=True and use_cache_kernel=True, kv-cache-quantization will be enabled.
The specific use method is as follows:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True,
use_cache_quantization=True,
use_cache_kernel=True,
use_flash_attn=False
)
```
Attention:
Currently, kv-cache-quantization and flash attn cannot be turned on at the same time.
If you enable kv cache quantization and use_flash_attn at the same time (use_flash_attn=True, use_cache_quantization=True, use_cache_kernel=True), use_flash_attn is disabled by default(use_flash_attn=false).
### Comparative Results
#### Results
We have verified that the use of the quantized int8-kvcache model does not suffer from significant performance degradation.
#### memory usage comparison
The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4.
We use BF16 models, and generate 1024 tokens (seq-length=1024) by default, and oom indicates out of memory.
With kv-cache quantization turned on, we can run a larger batch size(bs).
| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
| --- | :---: | :---: | :---: | :---: | :---: | :---: |
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
With kv-cache quantization turned on, the model can save more memory when generate longer seq-length (sl, number of tokens generated) at infer.
| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
| --- | :---: | :---: | :---: | :---: | :---: |
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
### Difference of Storage in layer-past
The model which turn on the kv-cache quantization will convert the format of layer-past from float to int8, meanwhile the quantianted layer-past will also store quantiantion parameters of current value.
Specific steps are as follows:
1、Quantize key/value
```
qv,scale,zero_point=quantize_cache_v(v)
```
2、Store into layer_past
Following is the format of quantized layer_past:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
Bascial format of layer_past:
```
layer_past=(key,value)
```
If you want to use the attention KV which is quantized,
you can use the dequantization operation to convert the int8 key/value back to the float format as following:
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
## Finetuning
Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed (Note: this may have conflicts with the latest version of pydantic) and Peft. You can install them by:
@ -369,7 +433,7 @@ pip install -r requirements_web_demo.txt
Then run the command below and click on the generated link:
```
```bash
python web_demo.py
```
@ -383,7 +447,7 @@ python web_demo.py
We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
```
```bash
python cli_demo.py
```

@ -42,9 +42,8 @@
## 新闻
* 2023年9月25日 🔥 开源了[qwen.cpp](https://github.com/QwenLM/qwen.cpp)Qwen-LM的C++实现。
* 2023年9月25日 🔥 在魔搭社区ModelScope和Hugging Face推出**Qwen-14B**和**Qwen-14B-Cha**t模型并同步更新**Qwen-7B**和**Qwen-7B-Chat**模型。相比原版Qwen-7B新版用了更多训练数据2.4T token序列长度从2048扩展至8192。整体中文能力以及代码能力提升较多。**请确保你使用的是最新的代码和模型!**
* 2023年9月25日 🔥 在魔搭社区ModelScope和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!**
- 相比原版Qwen-7B新版用了更多训练数据从2.2T增加到2.4T tokens序列长度从2048扩展至8192。整体中文能力以及代码能力均有所提升。
* 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调其中包括全参数微调、LoRA以及Q-LoRA。
* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型Qwen-7B-Chat-Int4。该模型显存占用低推理速度相比半精度模型显著提升在基准评测上效果损失较小。
* 2023年8月3日 在魔搭社区ModelScope和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时我们发布了技术备忘录介绍了相关的训练细节和模型表现。
@ -271,6 +270,68 @@ response, history = model.chat(tokenizer, "Hi", history=None)
上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
<br><br>
## KV cache量化
在模型infer时可以将中间结果key以及value的值量化后压缩存储这样便可以在相同的卡上存储更多的key以及value增加样本吞吐。
### 使用方法
提供use_cache_quantization以及use_cache_kernel两个参数对模型控制当use_cache_quantization以及use_cache_kernel均开启时将启动kv-cache量化的功能。具体使用如下
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True,
use_cache_quantization=True,
use_cache_kernel=True,
use_flash_attn=False
)
```
注意当前该功能目前不支持与flash attn同时开启如果你开了kv cache量化的同时又开了flash attnuse_flash_attn=True use_cache_quantization=True, use_cache_kernel=True会默认将use flash attn关闭。
### 结果对比
#### 效果
我们验证过int8 kvcache的使用对模型整体的精度指标基本无损。
#### 显存对比
本次评测运行于单张A100-SXM4-80G GPU模型默认使用BF16格式默认生成的seq-length=1024生成1024个token其中oom表示out of memory。
开启了kv-cache量化之后模型在infer的时候可以开启更大的batch size(bs)
| USE KVCache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
| --- | :---: | :---: | :---: | :---: | :---: | :---: |
| no | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
开启了kv-cache量化之后模型在infer时预测更长的seq-lengthsl生成的token数结果时可以节约更多的显存。
| USE KVCache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
| --- | :---: | :---: | :---: | :---: | :---: |
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
### 存储格式区别
模型开启kv cache量化后再模型infer的时候会将原始存进layer_past的float格式的key/value变成int8格式的qkey/qvalue和相对应的量化参数。
具体操作如下:
1、将key/value进行量化操作
```
qv,scale,zero_point=quantize_cache_v(v)
```
2、存入layer_past中:
量化格式的layer_past:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
原始格式的layer_past:
```
layer_past=(key,value)
```
如果需要将layer_past中存好的keyvalue直接取出使用可以使用反量化操作将int8格式的key/value转回float格式
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
## 微调
@ -372,7 +433,7 @@ python web_demo.py
我们提供了一个简单的交互式Demo示例请查看`cli_demo.py`。当前模型已经支持流式输出用户可通过输入文字的方式和Qwen-7B-Chat交互模型将流式输出返回结果。运行如下命令
```
```bash
python cli_demo.py
```

@ -4,14 +4,14 @@
<br><br>
<p align="center">
<img src="assets/logoqwen.jpg" width="400"/>
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" width="400"/>
<p>
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">ModelScope<a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper&nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">ModelScope<a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf">Paper</a> &nbsp&nbsp &nbsp&nbsp🖥 <a href="https://modelscope.cn/studios/qwen/Qwen-14B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp &nbsp&nbsp DingTalk &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
</p>
<br><br>
@ -49,7 +49,7 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1
## ニュースとアップデート
* 2023.9.25 🔥 Qwen-14BとQwen-14B-ChatをModelScopeとHugging Faceでリリースしました。同時に、Qwen-7B と Qwen-7B-Chat も更新しました。Qwen-7Bオリジナルと比較して、Qwen-7Bはより多くの学習トークンを使用し、2.2Tトークンから2.4Tトークンに増加し、コンテキスト長は2048から8192に拡張された。Qwen-7Bの中国語知識とコーディング能力はさらに向上しています。最新のコードとチェックポイントをお使いください
* 2023.9.25 🔥 Qwen-14BとQwen-14B-ChatをModelScopeとHugging Faceでリリースしました。[qwen.cpp](https://github.com/QwenLM/qwen.cpp) と [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) もリリースされました。同時に、Qwen-7B と Qwen-7B-Chat も更新しました。Qwen-7Bオリジナルと比較して、Qwen-7Bはより多くの学習トークンを使用し、2.2Tトークンから2.4Tトークンに増加し、コンテキスト長は2048から8192に拡張された。Qwen-7Bの中国語知識とコーディング能力はさらに向上しています。最新のコードとチェックポイントをお使いください
* 2023.9.12 Qwen-7Bモデルにおいて、フルパラメーター・ファインチューニング、LoRA、Q-LoRAを含むファインチューニングをサポートしました。
* 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル **Qwen-7B-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きな性能低下は見られませんでした。
* 2023.8.3 ModelScope と Hugging Face 上で **Qwen-7B****Qwen-7B-Chat** をリリースしました。また、トレーニングの詳細やモデルの性能など、モデルの詳細については技術メモを提供しています。

Loading…
Cancel
Save