From b7eb73d6ec4dca4dd1433033ef731b1eb88bd053 Mon Sep 17 00:00:00 2001
From: "feihu.hf" <feihu.hf@alibaba-inc.com>
Date: Thu, 14 Dec 2023 16:25:00 +0800
Subject: [PATCH] update readme for vllm-gptq

---
 README.md    | 12 ++++++++++--
 README_CN.md | 11 +++++++++--
 README_JA.md | 12 ++++++++++--
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 550b94b..be7781a 100644
--- a/README.md
+++ b/README.md
@@ -791,10 +791,15 @@ For deployment and fast inference, we suggest using vLLM.
 If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
 
 ```bash
-pip install vllm
+# pip install vllm  # This line is faster but it does not support quantization models.
+
+# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
+git clone https://github.com/QwenLM/vllm-gptq
+cd vllm-gptq
+pip install -e .
 ```
 
-Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq).
 
 #### vLLM + Transformer-like Wrapper
 
@@ -804,6 +809,7 @@ You can download the [wrapper codes](examples/vllm_wrapper.py) and execute the f
 from vllm_wrapper import vLLMWrapper
 
 model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
+# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', tensor_parallel_size=1, dtype="float16")
 
 response, history = model.chat(query="你好", history=None)
 print(response)
@@ -829,10 +835,12 @@ python -m fastchat.serve.controller
 Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run:
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 # run int4 model
 ```
 However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below:
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 # run int4 model
 ```
 
 After launching your model worker, you can launch a:
diff --git a/README_CN.md b/README_CN.md
index eb22248..9a1ecd4 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -781,10 +781,15 @@ tokenizer.save_pretrained(new_model_directory)
 如果你使用cuda12.1和pytorch2.1，可以直接使用以下命令安装vLLM。
 
 ```bash
-pip install vllm
+# pip install vllm  # 该方法安装较快，但官方版本不支持量化模型
+
+# 下面方法支持int4量化 (int8量化模型支持将近期更新)，但安装更慢 (约~10分钟)。
+git clone https://github.com/QwenLM/vllm-gptq
+cd vllm-gptq
+pip install -e .
 ```
 
-否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
+否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)，或者安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。
 
 #### vLLM + 类Transformer接口
 
@@ -819,10 +824,12 @@ python -m fastchat.serve.controller
 然后启动model worker读取模型。如使用单卡推理，运行如下命令：
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 # 运行int4模型
 ```
 然而，如果你希望使用多GPU加速推理或者增大显存，你可以使用vLLM支持的模型并行机制。假设你需要在4张GPU上运行你的模型，命令如下所示：
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 # 运行int4模型
 ```
 
 启动model worker后，你可以启动一个：
diff --git a/README_JA.md b/README_JA.md
index ddd31fe..f5d8598 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -784,10 +784,15 @@ Qwen-72B については、2 つの方法で実験します。1) 4 つの A100-S
 
 cuda 12.1 および pytorch 2.1 を使用している場合は、次のコマンドを直接使用して vLLM をインストールできます。
 ```bash
-pip install vllm
+# pip install vllm  # この行はより速いですが、量子化モデルをサポートしていません。
+
+# 以下のはINT4の量子化をサポートします（INT8はまもなくサポートされます）。 インストールは遅くなります（〜10分）。
+git clone https://github.com/QwenLM/vllm-gptq
+cd vllm-gptq
+pip install -e .
 ```
 
-それ以外の場合は、公式 vLLM [インストール手順](https://docs.vllm.ai/en/latest/getting_started/installation.html) を参照してください。
+それ以外の場合は、公式 vLLM [インストール手順](https://docs.vllm.ai/en/latest/getting_started/installation.html) 、または[GPTQの量子化 vLLM レポ](https://github.com/QwenLM/vllm-gptq)を参照してください。
 
 #### vLLM + Transformer Wrapper
 
@@ -797,6 +802,7 @@ pip install vllm
 from vllm_wrapper import vLLMWrapper
 
 model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
+# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', tensor_parallel_size=1, dtype="float16")
 
 response, history = model.chat(query="你好", history=None)
 print(response)
@@ -819,10 +825,12 @@ python -m fastchat.serve.controller
 それからmodel workerを起動し、推論のためにモデルをロードします。シングルGPU推論の場合は、直接実行できます：
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 # INT4モデルを実行します
 ```
 しかし、より高速な推論や大容量メモリーのために複数のGPUでモデルを実行したい場合は、vLLMがサポートするテンソル並列を使用することができます。モデルを4GPUで実行するとすると、コマンドは以下のようになります：
 ```bash
 python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 # run int4 model # INT4モデルを実行します
 ```
 
 モデルワーカーを起動した後、起動することができます：