diff --git a/README.md b/README.md index d49bc30..2a644fd 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,6 @@ Qwen-7B is the 7B-parameter version of the large language model series, Qwen (ab The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues. - ## News * 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance. @@ -250,11 +249,11 @@ Note: The GPU memory usage profiling in the above table is performed on single A We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively. -| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) | -| ------ | :---------------------------: | :---------------------------: | -| BF16 (no quantization) | 30.06 | 27.55 | -| Int8 (bnb) | 7.94 | 7.86 | -| NF4 (bnb) | 21.43 | 20.37 | +| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) | +| ---------------------- | :----------------------------------------: | :---------------------------------------: | +| BF16 (no quantization) | 30.06 | 27.55 | +| Int8 (bnb) | 7.94 | 7.86 | +| NF4 (bnb) | 21.43 | 20.37 | In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens. @@ -265,30 +264,23 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a When using flash attention, the memory usage is: | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | -| --- | :---: | :---: | -| BF16 | 18.11GB | 23.52GB | -| Int8 | 12.17GB | 17.60GB | -| NF4 | 9.52GB | 14.93GB | +| ------------------ | :---------------------------------: | :-----------------------------------: | +| BF16 | 18.11GB | 23.52GB | +| Int8 | 12.17GB | 17.60GB | +| NF4 | 9.52GB | 14.93GB | When not using flash attention, the memory usage is: | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | -| --- | :---: | :---: | -| BF16 | 18.11GB | 24.40GB | -| Int8 | 12.18GB | 18.47GB | -| NF4 | 9.52GB | 15.81GB | +| ------------------ | :---------------------------------: | :-----------------------------------: | +| BF16 | 18.11GB | 24.40GB | +| Int8 | 12.18GB | 18.47GB | +| NF4 | 9.52GB | 15.81GB | The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py). ## Demo -### CLI Demo - -We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below: - -``` -python cli_demo.py -``` ### Web UI @@ -304,16 +296,40 @@ Then run the command below and click on the generated link: python web_demo.py ``` +
+
+
+
+
+ +### CLI Demo + +We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below: + +``` +python cli_demo.py +``` + +
+
+
+
+
+ ## API + We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages: ```bash pip install fastapi uvicorn openai pydantic sse_starlette ``` + Then run the command to deploy your API: + ```bash python openai_api.py ``` + You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them. Using the API is also simple. See the example below: @@ -345,6 +361,11 @@ response = openai.ChatCompletion.create( print(response.choices[0].message.content) ``` +
+
+
+
+
## Tool Usage diff --git a/README_CN.md b/README_CN.md index aea803b..af4d8f9 100644 --- a/README_CN.md +++ b/README_CN.md @@ -280,19 +280,10 @@ model = AutoModelForCausalLM.from_pretrained( | Int8 | 12.18GB | 18.47GB | | NF4 | 9.52GB | 15.81GB | - 以上测速和显存占用情况,均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。 ## Demo -### 交互式Demo - -我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令: - -``` -python cli_demo.py -``` - ### Web UI 我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前,确保已经安装如下代码库: @@ -307,16 +298,41 @@ pip install -r requirements_web_demo.txt python web_demo.py ``` +
+
+
+
+
+ + +### 交互式Demo + +我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令: + +``` +python cli_demo.py +``` + +
+
+
+
+
+ ## API + 我们提供了OpenAI API格式的本地API部署方法(感谢@hanpenggit)。在开始之前先安装必要的代码库: ```bash pip install fastapi uvicorn openai pydantic sse_starlette ``` + 随后即可运行以下命令部署你的本地API: + ```bash python openai_api.py ``` + 你也可以修改参数,比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题,更新上述代码库往往可以解决大多数问题。 使用API同样非常简单,示例如下: @@ -348,6 +364,11 @@ response = openai.ChatCompletion.create( print(response.choices[0].message.content) ``` +
+
+
+
+
## 工具调用 @@ -405,7 +426,6 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct 如遇到问题,敬请查阅[FAQ](FAQ_zh.md)以及issue区,如仍无法解决再提交issue。 - ## 使用协议 研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。 diff --git a/README_JA.md b/README_JA.md index e1431d3..f178493 100644 --- a/README_JA.md +++ b/README_JA.md @@ -285,14 +285,6 @@ Flash attentionを使用しない場合、メモリ使用量は次のように ## デモ -### CLI デモ - -`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する: - -``` -python cli_demo.py -``` - ### ウェブ UI ウェブUIデモを構築するためのコードを提供します(@wysaidに感謝)。始める前に、以下のパッケージがインストールされていることを確認してください: @@ -307,7 +299,28 @@ pip install -r requirements_web_demo.txt python web_demo.py ``` +
+
+
+
+
+ +### CLI デモ + +`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する: + +``` +python cli_demo.py +``` + +
+
+
+
+
+ ## API + OpenAI APIをベースにローカルAPIをデプロイする方法を提供する(@hanpenggitに感謝)。始める前に、必要なパッケージをインストールしてください: ```bash @@ -351,6 +364,12 @@ response = openai.ChatCompletion.create( print(response.choices[0].message.content) ``` +
+
+
+
+
+ ## ツールの使用 Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。 diff --git a/assets/cli_demo.gif b/assets/cli_demo.gif new file mode 100644 index 0000000..61188ec Binary files /dev/null and b/assets/cli_demo.gif differ diff --git a/assets/openai_api.gif b/assets/openai_api.gif new file mode 100644 index 0000000..65e494c Binary files /dev/null and b/assets/openai_api.gif differ diff --git a/assets/web_demo.gif b/assets/web_demo.gif new file mode 100644 index 0000000..eee2c06 Binary files /dev/null and b/assets/web_demo.gif differ