update readme

2 years ago · 83eac494b2
parent b5fad3d561
commit 83eac494b2
3 changed files with 20 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -32,6 +32,7 @@ In this repo, you can figure out:
 * Details about the quantization models, including usage, memory, inference speed. For comparison, we also provide the statistics of the BF16 models.
 * Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA.
 * Instructions on building demos, including WebUI, CLI demo, etc.
+* Instructions on building an OpenAI-style API for your model.
 * Information about Qwen for tool use, agent, and code interpreter
 * Statistics of long-context understanding evaluation
 * License agreement
@ -397,7 +398,11 @@ sh finetune/finetune_lora_single_gpu.sh
 sh finetune/finetune_lora_ds.sh
 ```

-In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script:
+In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. 
+
+Note: To run single-GPU Q-LoRA training, you may need to install `mpi4py` through `pip` or `conda`.
+
+To run Q-LoRA, directly run the following script:

 ```bash
 # Single GPU training
--- a/README_CN.md
+++ b/README_CN.md
@ -26,10 +26,11 @@

 在这个项目中，你可以了解到以下内容

-* 快速上手Qwen-Chat教程，玩转大模型推理.
+* 快速上手Qwen-Chat教程，玩转大模型推理
 * 量化模型相关细节，包括用法、显存占用、推理性能等。这部分还提供了和非量化模型的对比。
-* 微调的教程，帮你实现全参数微调、LoRA以及Q-LoRA。
+* 微调的教程，帮你实现全参数微调、LoRA以及Q-LoRA
 * 搭建Demo的方法，包括WebUI和CLI Demo
+* 搭建API的方法，我们提供的示例为OpenAI风格的API
 * 更多关于Qwen在工具调用、Code Interpreter、Agent方面的内容
 * 长序列理解能力及评测
 * 使用协议
@ -383,7 +384,11 @@ sh finetune/finetune_lora_single_gpu.sh
 sh finetune/finetune_lora_ds.sh
 ```

-与全参数微调不同，LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型，也意味着更小的计算开销。然而，如果你依然遇到显存不足的问题，可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314))。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。运行Q-LoRA你只需运行如下脚本：
+与全参数微调不同，LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型，也意味着更小的计算开销。然而，如果你依然遇到显存不足的问题，可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314))。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。
+
+注意：如你使用单卡Q-LoRA，你可能需要安装`mpi4py`。你可以通过`pip`或者`conda`来安装。
+
+运行Q-LoRA你只需运行如下脚本：

 ```bash
 # 单卡训练
--- a/README_JA.md
+++ b/README_JA.md
@ -37,6 +37,7 @@ Qwen-7B**と**Qwen-14B**の**Qwen**シリーズと、**Qwen-7B-Chat**と**Qwen-1
 * 量子化モデルの詳細（使用量、メモリ、推論速度など）。比較のために、BF16モデルの統計も提供します。
 * フルパラメーターチューニング、LoRA、Q-LoRAを含む、微調整に関するチュートリアル。
 * WebUI、CLIデモなど、デモの構築に関する説明。
+* あなたのモデルのためのOpenAIスタイルのAPIを構築する手順。
 * ツール使用、エージェント、コードインタプリタの Qwen の詳細。
 * ロングコンテクスト理解評価の統計
 * ライセンス契約
@ -391,7 +392,11 @@ sh finetune/finetune_lora_single_gpu.sh
 sh finetune/finetune_lora_ds.sh
 ```

-LoRA ([論文](https://arxiv.org/abs/2106.09685)) は、フルパラメーターによるファインチューニングと比較して、adapterのパラメーターを更新するだけで、元の大きな言語モデル層は凍結されたままである。そのため、メモリコストが大幅に削減でき、計算コストも削減できる。しかし、それでもメモリ不足に悩む場合は、Q-LoRA（[論文](https://arxiv.org/abs/2305.14314)）を検討することができます。これは、量子化されたラージ言語モデルと、ページド・アテンションなどの他のテクニックを使用し、さらに少ないメモリコストで実行することができます。Q-LoRAを実行するには、以下のスクリプトを直接実行してください：
+LoRA ([論文](https://arxiv.org/abs/2106.09685)) は、フルパラメーターによるファインチューニングと比較して、adapterのパラメーターを更新するだけで、元の大きな言語モデル層は凍結されたままである。そのため、メモリコストが大幅に削減でき、計算コストも削減できる。しかし、それでもメモリ不足に悩む場合は、Q-LoRA（[論文](https://arxiv.org/abs/2305.14314)）を検討することができます。これは、量子化されたラージ言語モデルと、ページド・アテンションなどの他のテクニックを使用し、さらに少ないメモリコストで実行することができます。
+
+注：シングル GPU Q-LoRA トレーニングを実行するには、`pip` または `conda` を使って `mpi4py` をインストールする必要があるかもしれない。
+
+Q-LoRAを実行するには、以下のスクリプトを直接実行してください：

 ```bash
 # シングルGPUトレーニング