add result of int8 models

2 years ago · 93963f8d1f
parent e3a7c5ecc7
commit 93963f8d1f
3 changed files with 25 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -333,15 +333,16 @@ model = AutoModelForCausalLM.from_pretrained(
 response, history = model.chat(tokenizer, "Hi", history=None)
 ```

-We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
+We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

 | Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
 |----------------------|:----:|:-----------:|:-----:|:---------:|
-| Qwen-7B-Chat (BF16)  | 53.9 |    54.2     | 41.1  |   24.4    |
-| Qwen-7B-Chat (Int4)  | 52.6 |    52.9     | 38.1  |   23.8    |
-| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 61.0  |   43.9    |
+| Qwen-7B-Chat (BF16)  | 55.8 |    59.7     | 50.3  |   37.2    |
+| Qwen-7B-Chat (Int8)  | 55.4 |    59.4     | 48.3  |   34.8    |
+| Qwen-7B-Chat (Int4)  | 55.1 |    59.2     | 49.7  |   29.9    |
+| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 60.1  |   43.9    |
+| Qwen-14B-Chat (Int8) | 63.6 |    68.6     | 60.0	|   48.2    |
 | Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |
-<br>

 ### Quantization of KV cache
 Attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The parameters of 'use_cache_quantization' and 'use_cache_kernel' are provided to control kv-cache-quantization behavior
@ -478,7 +479,9 @@ We measured the average inference speed (tokens/s) of generating 2048 and 8192 t
 </table>


-In detail, the setting of profiling is encoding 2048 tokens and generating 8192 new tokens. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the encoded and generated tokens.
+In detail, the setting of profiling is encoding 2048 tokens and generating 8192 new tokens. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the encoded and generated tokens.
+
+Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using ``AutoModelForCausalLM.from_pretrained`` will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.

 ### GPU Memory Usage

--- a/README_CN.md
+++ b/README_CN.md
@ -324,14 +324,15 @@ model = AutoModelForCausalLM.from_pretrained(
 response, history = model.chat(tokenizer, "Hi", history=None)
 ```

-
-我们对BF16和Int4模型在基准评测上做了测试，发现量化模型效果损失较小，结果如下所示：
+我们对BF16，Int8和Int4模型在基准评测上做了测试，发现量化模型效果损失较小，结果如下所示：

 | Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
 |----------------------|:----:|:-----------:|:-----:|:---------:|
-| Qwen-7B-Chat (BF16)  | 53.9 |    54.2     | 41.1  |   24.4    |
-| Qwen-7B-Chat (Int4)  | 52.6 |    52.9     | 38.1  |   23.8    |
-| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 61.0  |   43.9    |
+| Qwen-7B-Chat (BF16)  | 55.8 |    59.7     | 50.3  |   37.2    |
+| Qwen-7B-Chat (Int8)  | 55.4 |    59.4     | 48.3  |   34.8    |
+| Qwen-7B-Chat (Int4)  | 55.1 |    59.2     | 49.7  |   29.9    |
+| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 60.1  |   43.9    |
+| Qwen-14B-Chat (Int8) | 63.6 |    68.6     | 60.0	|   48.2    |
 | Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |
 <br>

@ -467,6 +468,8 @@ model = AutoModelForCausalLM.from_pretrained(

 评测运行于单张A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是编码2048个token和生成8192个token的速度均值。

+注意：以上Int4/Int8模型生成速度使用autogptq库给出，当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队，若有解决方案将即时更新。
+
 ### 显存使用

 我们还测算了BF16、Int8和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果（GB）如下所示：
--- a/README_JA.md
+++ b/README_JA.md
@ -327,13 +327,15 @@ model = AutoModelForCausalLM.from_pretrained(
 response, history = model.chat(tokenizer, "Hi", history=None)
 ```

-ベンチマークにおける BF16 モデルと Int4 モデルの性能について説明します。その結果は以下に示します：
+ベンチマークにおける BF16 モデルと Int8、Int4 モデルの性能について説明します。その結果は以下に示します：

 | Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
 |----------------------|:----:|:-----------:|:-----:|:---------:|
-| Qwen-7B-Chat (BF16)  | 53.9 |    54.2     | 41.1  |   24.4    |
-| Qwen-7B-Chat (Int4)  | 52.6 |    52.9     | 38.1  |   23.8    |
-| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 61.0  |   43.9    |
+| Qwen-7B-Chat (BF16)  | 55.8 |    59.7     | 50.3  |   37.2    |
+| Qwen-7B-Chat (Int8)  | 55.4 |    59.4     | 48.3  |   34.8    |
+| Qwen-7B-Chat (Int4)  | 55.1 |    59.2     | 49.7  |   29.9    |
+| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 60.1  |   43.9    |
+| Qwen-14B-Chat (Int8) | 63.6 |    68.6     | 60.0	|   48.2    |
 | Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |

 ### KVキャッシュ量子化
@ -468,6 +470,8 @@ BF16、Int8、Int4の精度のモデルを用いて、2048個と8192個のトー

 詳細には、プロファイリングの設定は、2048個のトークンをエンコードし、8192個の新しいトークンを生成することである。プロファイリングは、PyTorch 2.0.1とCUDA 11.4を搭載したシングルA100-SXM4-80G GPUで実行される。推論速度はエンコードされたトークンと生成されたトークンの平均である。

+注意：上記のInt4/Int8モデルの推論速度は、autogptqを使用しています。現在、``AutoModelForCausalLM.from_pretrained``で読み込まれるモデルの推論速度は約20%遅くなります。この問題はHuggingFaceチームに報告済みであり、解決策があれば即座に更新されます。
+
 ### GPU メモリ使用量

 また、BF16、Int8、Int4量子化レベルのそれぞれにおいて、2048個のトークンをコンテキストとしてエンコードした場合（および単一のトークンを生成した場合）と、8192個のトークンを生成した場合（単一のトークンをコンテキストとして生成した場合）のGPUメモリ使用量のピーク値をプロファイリングしました。結果（GB）を以下に示します。