From 2167406b724ca961f3ee3f49cc8f8423e26254cc Mon Sep 17 00:00:00 2001 From: Yang An Date: Mon, 28 Aug 2023 20:35:33 +0800 Subject: [PATCH] update speed profiling result after optimizing memory cost --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d6e1ef1..bb6020b 100644 --- a/README.md +++ b/README.md @@ -237,8 +237,8 @@ We measured the average inference speed (tokens/s) of generating 2048 and 8192 t | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | -------------- | :-------------------: | :-------------------: | -| BF16 | 30.53 | 28.51 | -| Int4 | 45.60 | 33.83 | +| BF16 | 30.34 | 29.32 | +| Int4 | 43.56 | 33.92 | In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens. @@ -248,8 +248,8 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | -------------- | :-----------------------------------: | :-------------------------------------: | -| BF16 | 18.99GB | 24.40GB | -| Int4 | 10.20GB | 15.61GB | +| BF16 | 17.66GB | 22.58GB | +| Int4 | 8.21GB | 13.62GB | The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).