diff --git a/README.md b/README.md
index 262c221..94fbcf6 100644
--- a/README.md
+++ b/README.md
@@ -449,7 +449,7 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed.
### Profiling of Memory and Speed
-We profile the GPU memory and training speed of both LoRA and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, and 2048. The statistics are listed below:
+We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, and 2048. The statistics are listed below:
diff --git a/README_CN.md b/README_CN.md
index 27430c1..3bdb19d 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -435,7 +435,7 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。
### 显存占用及训练速度
-下面记录7B和14B模型在单GPULoRA和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0。我们统一使用batch size为1,gradient accumulation为8的训练配置,记录输入长度分别为256、512、1024和2048的显存占用(GB)和训练速度(s/iter)。具体数值如下所示:
+下面记录7B和14B模型在单GPU使用LoRA(LoRA (emb)指的是embedding和输出层参与训练,而LoRA则不优化这部分参数)和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0。我们统一使用batch size为1,gradient accumulation为8的训练配置,记录输入长度分别为256、512、1024和2048的显存占用(GB)和训练速度(s/iter)。具体数值如下所示:
@@ -445,13 +445,19 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
256 | 512 | 1024 | 2048 |
- 7B | LoRA | 33.5G / 1.6s/it | 34.0G / 1.7s/it | 35.0G / 3.0s/it | 35.0G / 5.7s/it |
+ 7B | LoRA | 19.9G / 1.6s/it | 20.2G / 1.6s/it | 21.5G / 2.9s/it | 23.7G / 5.5s/it |
+
+
+ LoRA (emb) | 33.5G / 1.6s/it | 34.0G / 1.7s/it | 35.0G / 3.0s/it | 35.0G / 5.7s/it |
Q-LoRA | 11.5G / 3.0s/it | 12.2G / 3.6s/it | 12.7G / 4.8s/it | 13.9G / 7.3s/it |
- 14B | LoRA | 51.0G / 2.1s/it | 51.0G / 2.7s/it | 51.5G / 5.0s/it | 53.9G / 9.2s/it |
+ 14B | LoRA | 34.5G / 2.0s/it | 35.0G / 2.5s/it | 35.2G / 4.9s/it | 37.3G / 8.9s/it |
+
+
+ LoRA (emb) | 51.0G / 2.1s/it | 51.0G / 2.7s/it | 51.5G / 5.0s/it | 53.9G / 9.2s/it |
Q-LoRA | 18.3G / 5.4s/it | 18.4G / 6.4s/it | 18.5G / 8.5s/it | 19.9G / 12.4s/it |
diff --git a/README_JA.md b/README_JA.md
index 2622361..b6a976b 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -443,7 +443,7 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
注意:マルチGPUトレーニングの場合、分散トレーニング用の適切なハイパーパラメータをマシンに応じて指定する必要があります。また、データ、メモリフットプリント、トレーニング速度を考慮して、引数 `--model_max_length` で最大シーケンス長を指定することをお勧めします。
### メモリと速度のプロファイリング
-シングルGPUトレーニングのセットアップにおいて、LoRAとQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。256、512、1024、2048という異なる長さの入力のメモリ(GB)と速度(s/iter)をプロファイリングします。統計量を以下に示す:
+シングルGPUトレーニングのセットアップにおいて、LoRA (LoRA(emb)はembeddingと出力層を学習させるが、LoRAはembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。256、512、1024、2048という異なる長さの入力のメモリ(GB)と速度(s/iter)をプロファイリングします。統計量を以下に示す:
@@ -453,13 +453,19 @@ merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_
256 | 512 | 1024 | 2048 |
- 7B | LoRA | 33.5G / 1.6s/it | 34.0G / 1.7s/it | 35.0G / 3.0s/it | 35.0G / 5.7s/it |
+ 7B | LoRA | 19.9G / 1.6s/it | 20.2G / 1.6s/it | 21.5G / 2.9s/it | 23.7G / 5.5s/it |
+
+
+ LoRA (emb) | 33.5G / 1.6s/it | 34.0G / 1.7s/it | 35.0G / 3.0s/it | 35.0G / 5.7s/it |
Q-LoRA | 11.5G / 3.0s/it | 12.2G / 3.6s/it | 12.7G / 4.8s/it | 13.9G / 7.3s/it |
- 14B | LoRA | 51.0G / 2.1s/it | 51.0G / 2.7s/it | 51.5G / 5.0s/it | 53.9G / 9.2s/it |
+ 14B | LoRA | 34.5G / 2.0s/it | 35.0G / 2.5s/it | 35.2G / 4.9s/it | 37.3G / 8.9s/it |
+
+
+ LoRA (emb) | 51.0G / 2.1s/it | 51.0G / 2.7s/it | 51.5G / 5.0s/it | 53.9G / 9.2s/it |
Q-LoRA | 18.3G / 5.4s/it | 18.4G / 6.4s/it | 18.5G / 8.5s/it | 19.9G / 12.4s/it |