update efficiency profiling in readme

1 year ago · af6486a0a9
parent 9b00721a66
commit af6486a0a9
2 changed files with 88 additions and 11 deletions
--- a/README.md
+++ b/README.md
@ -234,13 +234,51 @@ model = AutoModelForCausalLM.from_pretrained(
 ).eval()
 ```

-With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
+With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.

-| Precision   |   MMLU   |  Memory  |
-| :---------: | :------: | :------: |
-|   BF16      |   56.7   |   16.2G  |
-|   Int8      |   52.8   |   10.1G  |
-|    NF4      |   48.9   |   7.4G   |
+| Precision   |   MMLU   |  GPU Memory for Loading Model |
+| ----------- | :------: | :---------------------------: |
+|   BF16      |   56.7   |             16.38G            |
+|   Int8      |   52.8   |             10.44G            |
+|    NF4      |   48.9   |             7.79G             |
+
+Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and cuda 11.8, with flash attention used.
+
+## Inference Efficiency
+
+### Inference Speed
+
+We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
+
+| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
+| ------ | :---------------------------: | :---------------------------: |
+| BF16 (no quantization) | 30.06 | 27.55 |
+| Int8 (bnb) | 7.94 | 7.86 |
+| NF4 (bnb) | 21.43 | 20.37 |
+
+In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and cuda 11.8. The inference speed is averaged over the generated 2048 tokens.
+
+### GPU Memory Usage
+
+We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below
+
+When using flash attention, the memory usage is:
+
+| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| --- | :---: | :---: |
+| BF16 | 18.11GB | 23.52GB |
+| Int8 | 12.17GB | 17.60GB |
+| NF4 | 9.52GB | 14.93GB |
+
+When not using flash attention, the memory usage is:
+
+| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| --- | :---: | :---: |
+| BF16 | 18.11GB | 24.40GB |
+| Int8 | 12.18GB | 18.47GB |
+| NF4 | 9.52GB | 15.81GB |
+
+The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).

 ## Demo

--- a/README_CN.md
+++ b/README_CN.md
@ -238,11 +238,50 @@ model = AutoModelForCausalLM.from_pretrained(

 上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取，帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失，但模型的显存开销大幅降低。

-| Precision   |   MMLU   |  Memory  |
-| :---------: | :------: | :------: |
-|   BF16      |   56.7   |   16.2G  |
-|   Int8      |   52.8   |   10.1G  |
-|    NF4      |   48.9   |   7.4G   |
+| Precision   |   MMLU   |  GPU Memory for Loading Model |
+| ----------- | :------: | :---------------------------: |
+|   BF16      |   56.7   |             16.38G            |
+|   Int8      |   52.8   |             10.44G            |
+|    NF4      |   48.9   |             7.79G             |
+
+注：表中显存占用的测试环境为A100-SXM4-80G单卡，PyTorch 2.0.1，cuda11.8，开启flash attention
+
+## 推理性能
+
+### 推理速度
+
+我们分别测试了BF16和量化条件下，模型生成2K tokens的平均推理速度，结果如下
+
+| 量化等级  | 开flash_attn的推理速度 (字符/秒) | 关flash_attn的推理速度 (字符/秒) |
+| ------ | :---------------------------: | :---------------------------: |
+| BF16 (无量化) | 30.06 | 27.55 |
+| Int8 (bnb) | 7.94 | 7.86 |
+| NF4 (bnb) | 21.43 | 20.37 |
+
+具体的评测方式为：指定输入context长度为1，生成长度为2048；测试硬件为A100-SXM4-80G单卡，软件环境为PyTorch 2.0.1，cuda版本11.8，计算生成该2048序列的平均速度
+
+### 显存占用
+
+在BF16和不同量化条件下，我们分别测算了模型编码2048长度序列（并生成1个token），和生成8192长度序列（编码1个token作为context）的峰值显存占用。结果如下
+
+打开flash attention时
+
+| 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
+| --- | :---: | :---: |
+| BF16 | 18.11GB | 23.52GB |
+| Int8 | 12.17GB | 17.60GB |
+| NF4 | 9.52GB | 14.93GB |
+
+关闭flash attention时
+
+| 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
+| --- | :---: | :---: |
+| BF16 | 18.11GB | 24.40GB |
+| Int8 | 12.18GB | 18.47GB |
+| NF4 | 9.52GB | 15.81GB |
+
+
+以上测速和显存占用情况，均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。

 ## Demo