From 29fea23f874e3e021e67312dba1da307a3fb5fee Mon Sep 17 00:00:00 2001 From: yangapku Date: Tue, 9 Jan 2024 19:28:24 +0800 Subject: [PATCH] update README --- README.md | 4 +--- README_CN.md | 2 +- README_ES.md | 2 +- README_FR.md | 2 +- README_JA.md | 2 +- 5 files changed, 5 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ba8c85b..81ae69e 100644 --- a/README.md +++ b/README.md @@ -451,7 +451,7 @@ We illustrate the model performance of both BF16, Int8 and Int4 models on the be ### Quantization of KV cache > NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality -> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_245.cu`) may be missing. Please manually download +> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download > them from the Hugging Face Hub and place them into the same folder as the other module files. The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows: @@ -779,7 +779,6 @@ Our provided scripts support multinode finetuning. You can refer to the comments Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts. ### Profiling of Memory and Speed - We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter finetuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. For Qwen-7B, we also test the performance of multinode finetuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multinode finetuning are marked as LoRA (multinode) in the table. @@ -872,7 +871,6 @@ The statistics are listed below:
- ## Deployment ### vLLM diff --git a/README_CN.md b/README_CN.md index 9472748..a814e53 100644 --- a/README_CN.md +++ b/README_CN.md @@ -448,7 +448,7 @@ response, history = model.chat(tokenizer, "Hi", history=None) ### KV cache量化 -> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_356.cpp`与`cache_autogptq_cuda_kernel_245.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。 +> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_256.cpp`与`cache_autogptq_cuda_kernel_256.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。 在模型推理时,我们可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。 diff --git a/README_ES.md b/README_ES.md index 6010f46..1103e22 100644 --- a/README_ES.md +++ b/README_ES.md @@ -450,7 +450,7 @@ Ilustramos el rendimiento de los modelos BF16, Int8 e Int4 en la prueba de refer ### Cuantización de la caché KV > NOTA: Por favor, ten en cuenta que debido al mecanismo interno de Hugging Face, los archivos de soporte para esta funcionalidad -> (es decir, `cache_autogptq_cuda_256.cpp` y `cache_autogptq_cuda_kernel_245.cu`). +> (es decir, `cache_autogptq_cuda_256.cpp` y `cache_autogptq_cuda_kernel_256.cu`). > Por favor, descárguelos manualmente del Hugging Face Hub y colóquelos en la misma carpeta que los demás archivos del módulo. La caché KV de atención puede cuantificarse y comprimirse para su almacenamiento, con el fin de obtener un mayor rendimiento de la muestra. Los argumentos `use_cache_quantization` y `use_cache_kernel` en `config.json` se proporcionan para habilitar la cuantización de la caché KV. diff --git a/README_FR.md b/README_FR.md index f59e9e7..e323a76 100644 --- a/README_FR.md +++ b/README_FR.md @@ -451,7 +451,7 @@ Nous illustrons les performances des modèles BF16, Int8 et Int4 sur le benchmar ### Quantization du cache KV > NOTE : Veuillez noter qu'en raison du mécanisme interne de Hugging Face, les fichiers de support pour cette fonctionnalité -> (i.e., `cache_autogptq_cuda_256.cpp` et `cache_autogptq_cuda_kernel_245.cu`) peuvent être manquants. +> (i.e., `cache_autogptq_cuda_256.cpp` et `cache_autogptq_cuda_kernel_256.cu`) peuvent être manquants. > Veuillez les télécharger manuellement manuellement depuis le Hugging Face Hub et placez-les dans le même dossier que les autres fichiers du module. Le cache KV de l'attention peut être quantifié et compressé pour le stockage, afin d'obtenir un débit d'échantillonnage plus élevé. Les arguments `use_cache_quantization` et `use_cache_kernel` dans `config.json` sont fournis pour activer la quantification du cache KV. diff --git a/README_JA.md b/README_JA.md index 8961419..c4a939e 100644 --- a/README_JA.md +++ b/README_JA.md @@ -447,7 +447,7 @@ response, history = model.chat(tokenizer, "Hi", history=None) ### KVキャッシュ量子化 > 注意: Hugging Faceの内部メカニズムにより、この機能のサポートファイル -> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_245.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。 +> (すなわち、`cache_autogptq_cuda_256.cpp`と`cache_autogptq_cuda_kernel_256.cu`)が欠落している可能性があります。以下を手動でダウンロードしてください。 > Hugging Face Hubから手動でダウンロードし、他のモジュールファイルと同じフォルダに入れてください。 アテンション KV キャッシュを量子化して圧縮して保存すると、サンプルのスループットが向上する。この機能を有効にするには、`config.json` に `use_cache_quantization` と `use_cache_kernel` という引数を指定する。