init commit of recipes (#1027)

Add recipes
1 year ago · ee01f36ed9
parent d275e5b91a
commit ee01f36ed9
30 changed files with 5146 additions and 0 deletions
--- a/recipes/applications/chatbot/qwen_chatbot.ipynb
+++ b/recipes/applications/chatbot/qwen_chatbot.ipynb
@ -0,0 +1,131 @@
+{
+  "cells": [
+   {
+    "cell_type": "markdown",
+    "id": "54d5d255-aa98-4655-8dd1-bc726430d86a",
+    "metadata": {},
+    "source": [
+     "# Qwen-7B-Chat Chat Demo"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "31e04af4-eb27-4802-a7b2-6ea0525f1dc8",
+    "metadata": {},
+    "source": [
+     "This notebook uses Qwen-7B-Chat as an example to introduce you to how to build a web-based conversational assistant using Gradio."
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "75e51155-9f8e-40dc-8432-60f4567d93a8",
+    "metadata": {},
+    "source": [
+     "## Preparation"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "ff6f061c-a033-49f2-8f7d-af3f23ac9125",
+    "metadata": {},
+    "source": [
+     "Download Qwen-7B-Chat\n",
+     "\n",
+     "Firstly, we need to download the model. You can use the snapshot_download that comes with modelscope to download the model to a specified directory."
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "c469a129-451f-4d01-8bc0-e2cf70a262c8",
+    "metadata": {
+     "tags": []
+    },
+    "outputs": [],
+    "source": [
+     "!pip install modelscope"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "69af626e-22b8-49ad-8869-8354f4c72bcc",
+    "metadata": {
+     "tags": []
+    },
+    "outputs": [],
+    "source": [
+     "from modelscope.hub.snapshot_download import snapshot_download\n",
+     "snapshot_download(\"qwen/Qwen-7B-Chat\",cache_dir='/tmp/models') "
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "01d2ff34-4053-4710-a289-e354673be1ca",
+    "metadata": {},
+    "source": [
+     "## Install Dependencies"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "48b51791-4bbc-4d12-9cd6-587c24c8bea7",
+    "metadata": {
+     "tags": []
+    },
+    "outputs": [],
+    "source": [
+     "!pip install -r ../../../requirements.txt\n",
+     "!pip install gradio==3.37.0 mdtex2html"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "7732037a-246a-4953-af07-dae7a3ae5937",
+    "metadata": {},
+    "source": [
+     "## Run the web UI code to start the Qwen chatbot\n",
+     "\n",
+     "Users can run the web_demo.py file to have real-time conversations with Qwen-7b-chat on the webpage."
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "9e256f0a-d96d-4fd7-b305-fe43c6959dc8",
+    "metadata": {
+     "ExecutionIndicator": {
+      "show": true
+     },
+     "tags": []
+    },
+    "outputs": [],
+    "source": [
+     "!python ../../../web_demo.py -c /tmp/models/qwen/Qwen-7B-Chat"
+    ]
+   }
+  ],
+  "metadata": {
+   "kernelspec": {
+    "display_name": "Python 3 (ipykernel)",
+    "language": "python",
+    "name": "python3"
+   },
+   "language_info": {
+    "codemirror_mode": {
+     "name": "ipython",
+     "version": 3
+    },
+    "file_extension": ".py",
+    "mimetype": "text/x-python",
+    "name": "python",
+    "nbconvert_exporter": "python",
+    "pygments_lexer": "ipython3",
+    "version": "3.10.13"
+   }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+ }
+ 
--- a/recipes/applications/domain_finetune/qwen_domain_finetune.ipynb
+++ b/recipes/applications/domain_finetune/qwen_domain_finetune.ipynb
@ -0,0 +1,433 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0e7993c3-3999-4ac5-b1dc-77875d80e4c8",
+   "metadata": {},
+   "source": [
+    "# Fine-tuning Qwen-7B-Chat on Your Own Domain-Specific Data\n",
+    "\n",
+    "This notebook uses Qwen-7B-Chat as an example to introduce how to LoRA fine-tune the Qwen model on a specific domain.\n",
+    "\n",
+    "Qwen is a pretrained conversational model that supports English and Chinese. It is suitable for universal scenarios, but may lack some specialized knowledge in certain specific domain. If you would like to fine-tune it for a specific domain, or on your own private dataset, you can refer to this tutorial.\n",
+    "\n",
+    "Here is an example showing the differences before and after fine-tuning.\n",
+    "\n",
+    "#### Model Input\n",
+    ">怎样治疗青光眼?\n",
+    "\n",
+    "#### Before Fine-tuning\n",
+    "\n",
+    ">青光眼的治疗方法取决于病情严重程度。轻度青光眼可以采取药物治疗，比如激素类药物、抗生素等，以减轻炎症和改善血管紧张。\n",
+    ">\n",
+    ">中度青光眼可能需要激光手术来降低眼压，并使用药物以保持控制。\n",
+    ">\n",
+    ">重度青光眼则需要实施传统的眼部手术，如小梁切除术或巩膜外减压术，以防止视神经损伤。\n",
+    "\n",
+    "#### After Fine-tuning\n",
+    "\n",
+    ">青光眼是一种眼睛疾病，可以引起视神经的损伤和视力下降。治疗青光眼的主要方法是药物治疗、激光治疗和手术治疗。\n",
+    ">1. 药物治疗：对于早期青光眼患者，通常采用抗青光眼药物治疗，通过降低眼压来预防视神经进一步损伤。目前常用的抗青光眼药物有拉坦前列素滴眼液、阿托品滴眼液等。\n",
+    ">2. 激光治疗：对于中晚期青光眼患者，如果药物治疗效果不佳，可能需要进行激光治疗。目前常用的激光治疗方法包括小梁切除术、YAG激光手术等。\n",
+    ">3. 手术治疗：对于非常严重的青光眼患者，可能需要进行手术治疗。目前常用的手术治疗方法包括前房引流术、虹膜周切术、小梁切除术、巩膜滤过术等。\n",
+    ">\n",
+    ">需要注意的是，以上治疗方法的具体选择应根据患者的病情和医生的专业建议来进行。同时，青光眼患者在日常生活中也需要保持良好的生活习惯，避免过度疲劳、保持情绪稳定、定期检查眼睛等情况的发生。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bdea7e21-fec8-49fe-b7ea-afde3f02738f",
+   "metadata": {},
+   "source": [
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "Run the following command line in the main directory of the Qwen repo.\n",
+    "```bash\n",
+    "pip install -r requirements.txt\n",
+    "```\n",
+    "\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-7B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-7B-chat', cache_dir='.')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Medical Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a medical conversation dataset for training. It is sampled from [MedicalGPT repo](https://github.com/shibing624/MedicalGPT/) and we have converted this dataset into a format that can be used for fine-tuning.\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/medical_sft.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can prepare your dataset in JSON format following the format below, and then modify the `--data_path` parameter in the training command to point to your JSON file.\n",
+    "\n",
+    "These data instances can be conversations in the real world or include domain knowledge QA pairs. Besides, fine-tuning allows Qwen-chat to play like some specific roles. As Qwen-chat is a dialogue model for general scenarios, your fine-tuning can customize a chatbot to meet your requirements.\n",
+    "\n",
+    "We recommend that you prepare 50~ data instances if you want to fine-tune Qwen-chat as a roleplay model.\n",
+    "\n",
+    "You may prepare much more data instances if you want to infuse the domain knowledge of your field into the model.\n",
+    "\n",
+    "In this tutorial, we have prepared a medical domain fine-tuning dataset consisting of 1000 data instancess as an example. You can refer to our example to fine-tune on your own domain-specific dataset.\n",
+    "\n",
+    "Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```python\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"1\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"直肠腺瘤的早期症状?\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"结直肠腺瘤是起源于结直肠黏膜腺上皮的良性肿瘤，包括结肠腺瘤与直肠腺瘤，是常见的肠道良性肿瘤。因与大肠癌的发生关系密切，被认为是一种癌前病变。不同地区、不同年龄的发病率差别很大，40岁以下的发病率低，60岁以上较高，男女无明显差别。此类疾病的病因及发病机制不明确。部分患者有遗传因素存在，另外，高脂肪饮食与食物纤维不足、肠道菌群紊乱也与此病的发生有相关性。多数患者无症状。症状与腺瘤的大小、部位、病理性质等有关。有症状者可有以下表现：直肠腺瘤的临床常见症状有便血，多为无疼性便血，常呈鲜红色，量多，可引起贫血。粪便性状改变，多为粘液便，便意不尽或里急重感，有时只排出粘液，称为假性腹泻。1.腹部不适、腹部疼痛，多为大的腺瘤伴发肠套叠、肠梗阻引起。2.排便习惯改变，包括便秘、腹泻、里急后重等。3.粪便带血，最常见为间歇性便血。4.部分位于直肠的较大的、带蒂腺瘤可在排便时脱落或脱出肛。可以采取内镜下高频电凝、激光、微波凝固等方法切除，也可以选择外科手术切除，并定期随访。有恶变者根据情况选择其他治疗（如放疗、化疗、手术等）管状腺瘤切除术后复发者少见，但绒毛状腺瘤及绒毛管状腺瘤切除术后常可复发，尤其是绒毛状腺瘤，且多发的腺瘤复发率高于单发者。对于经内镜治疗或局部手术切除的结直肠腺瘤患者尤其是绒毛状腺瘤或广基的绒毛管状腺瘤患者，建议腺瘤切除后的第一年内3～6个月进行一次肠镜检查，第二年开始每年一次。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```python\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"2\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，我最近经常感觉胸口疼痛，这是怎么回事？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"胸痛可能由多种原因引起，包括心脏问题、消化系统疾病、呼吸系统问题等。您能描述一下疼痛的性质和持续时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"疼痛感觉像是挤压，大概持续了几分钟。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"挤压感的胸痛可能与心脏问题有关，特别是如果它伴随着呼吸困难、出汗或恶心。我建议您尽快去看医生并进行适当的检查，如心电图和血液检测，以确定具体原因。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我应该去急诊室吗，还是预约我的家庭医生？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"如果您的疼痛是突然发生的，并且还有其他症状，建议您立即去急诊室。如果疼痛不是很严重且没有其他严重症状，您可以预约家庭医生进一步评估。但请不要忽视疼痛，尤其是如果这种情况是第一次出现。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. \n",
+    "\n",
+    "For parameter settings, you can modify `--model_name_or_path` to the location of the model you want to fine-tune, and set `--data_path` to the location of the dataset.\n",
+    "\n",
+    "You should remove the `--bf16` parameter if you are using a non-Ampere architecture GPU, such as a V100. \n",
+    "\n",
+    "For `--model_max_length` and `--per_device_train_batch_size`, we recommend the following configurations, ,you can refer to [this document](../../finetune/deepspeed/readme.md) for more details:\n",
+    "\n",
+    "| --model_max_length | --per_device_train_batch_size | GPU Memory |\n",
+    "|-----------------|------------|--------------------|\n",
+    "| 512             | 4          | 24g                |\n",
+    "| 1024            | 3          | 24g                |\n",
+    "| 512             | 8          | 32g                |\n",
+    "| 1024            | 6          | 32g                |\n",
+    "\n",
+    "You can use our recommended saving parameters, or you can save by epoch by just setting `--save_strategy \"epoch\"` if you prefer to save at each epoch stage. `--save_total_limit` means the limit on the number of saved checkpoints.\n",
+    "\n",
+    "For other parameters, such as `--weight_decay` and `--adam_beta2`, we recommend using the values we provided blow.\n",
+    "\n",
+    "Setting the parameters `--gradient_checkpointing` and `--lazy_preprocess` is to save GPU memory.\n",
+    "\n",
+    "The parameters for the trained Lora module will be saved in the **output_qwen** folder."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../../../finetune/finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-7B-Chat/\"\\\n",
+    "    --data_path  \"medical_sft.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 4\\\n",
+    "    --per_device_train_batch_size 4 \\\n",
+    "    --per_device_eval_batch_size 3 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"epoch\" \\\n",
+    "    --save_steps 3000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 10 \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e6f28aa-1772-48ce-aa15-8cf29e7d67b5",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The LoRA training only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "4fd5ef2a-34f9-4909-bebe-7b3b086fd16a",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2024-01-26T02:46:14.585746Z",
+     "iopub.status.busy": "2024-01-26T02:46:14.585089Z",
+     "iopub.status.idle": "2024-01-26T02:47:08.095464Z",
+     "shell.execute_reply": "2024-01-26T02:47:08.094715Z",
+     "shell.execute_reply.started": "2024-01-26T02:46:14.585720Z"
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to \"AutoModelForCausalLM.from_pretrained\".\n",
+      "Try importing flash-attention for faster inference...\n",
+      "Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm\n",
+      "Loading checkpoint shards: 100%|██████████| 8/8 [00:06<00:00,  1.14it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e3f5b9f-63a1-4599-8d9b-a8d8f764838f",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "10fa5ea3-dd55-4901-86af-c045d4c56533",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2024-01-26T02:47:08.097051Z",
+     "iopub.status.busy": "2024-01-26T02:47:08.096744Z",
+     "iopub.status.idle": "2024-01-26T02:47:08.591289Z",
+     "shell.execute_reply": "2024-01-26T02:47:08.590665Z",
+     "shell.execute_reply.started": "2024-01-26T02:47:08.097029Z"
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('output_qwen_merged/tokenizer_config.json',\n",
+       " 'output_qwen_merged/special_tokens_map.json',\n",
+       " 'output_qwen_merged/qwen.tiktoken',\n",
+       " 'output_qwen_merged/added_tokens.json')"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-7B-chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "804b84d8",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "dbae310c",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2024-01-26T02:48:29.995040Z",
+     "iopub.status.busy": "2024-01-26T02:48:29.994448Z",
+     "iopub.status.idle": "2024-01-26T02:48:41.677104Z",
+     "shell.execute_reply": "2024-01-26T02:48:41.676591Z",
+     "shell.execute_reply.started": "2024-01-26T02:48:29.995019Z"
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm\n",
+      "Loading checkpoint shards: 100%|██████████| 8/8 [00:04<00:00,  1.71it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "VDAC1（电压依赖性钙通道）是一种位于细胞膜上的钙离子通道，负责将细胞内的钙离子释放到细胞外。它在神经信号传导、肌肉收缩和血管舒张中发挥着重要作用。\n",
+      "\n",
+      "VDAC1通常由4个亚基组成，每个亚基都有不同的功能。其中，一个亚基是内腔部分，它与钙离子的结合有关；另一个亚基是外腔部分，它与离子通道的打开和关闭有关；第三个亚基是一层跨膜蛋白，它负责调节通道的开放程度；最后一个亚基是一个膜骨架连接器，它帮助维持通道的结构稳定性。\n",
+      "\n",
+      "除了钙离子外，VDAC1还能够接收钾离子和氯离子等其他离子，并将其从细胞内释放到细胞外。此外，VDAC1还参与了许多细胞代谢反应，例如脂肪酸合成和糖原分解等。\n",
+      "\n",
+      "总的来说，VDAC1是细胞膜上的一种重要离子通道，其作用涉及到许多重要的生物学过程。\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"什么是VDAC1？\", history=None)\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "987f524d-6918-48ae-a730-f285cf6f8416",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/applications/retrieval/retrieval.ipynb
+++ b/recipes/applications/retrieval/retrieval.ipynb
@ -0,0 +1,411 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "245ab07a-fb2f-4cf4-ab9a-5c05a9b44daa",
+   "metadata": {},
+   "source": [
+    "# LangChain retrieval knowledge base Q&A based on Qwen-7B-Chat"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e8df2cb7-a69c-4231-9596-4c871d893633",
+   "metadata": {},
+   "source": [
+    "This notebook introduces a question-answering application based on a local knowledge base using Qwen-7B-Chat with langchain. The goal is to establish a knowledge base Q&A solution that is friendly to many scenarios and open-source models, and that can run offline. The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the LLM (Large Language Model) to generate an answer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92e9c81a-45c7-4c12-91af-3c5dd52f63bb",
+   "metadata": {},
+   "source": [
+    "## Preparation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84cfcf88-3bef-4412-a658-4eaefeb6502a",
+   "metadata": {},
+   "source": [
+    "Download Qwen-7B-Chat\n",
+    "\n",
+    "Firstly, we need to download the model. You can use the snapshot_download that comes with modelscope to download the model to a specified directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c1f9ded-8035-42c7-82c7-444ce06572bc",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!pip install modelscope"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7c26225c-c958-429e-b81d-2de9820670c2",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "snapshot_download(\"Qwen/Qwen-7B-Chat\",cache_dir='/tmp/models') "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e8f51796-49fa-467d-a825-ae9a281eb3fd",
+   "metadata": {},
+   "source": [
+    "Download the dependencies for langchain and Qwen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "87fe1023-644f-4610-afaf-0b7cddc30d60",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!pip install langchain==0.0.187 dashscope==1.0.4 sentencepiece==0.1.99 cpm_kernels==1.0.11 nltk==3.8.1 sentence_transformers==2.2.2 unstructured==0.6.5 faiss-cpu==1.7.4 icetk==0.0.7"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "853cdfa4-a2ce-4baa-919a-b9e2aecd2706",
+   "metadata": {},
+   "source": [
+    "Download the retrieval document."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ba800dc-311d-4a83-8115-f05b09b39ffd",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/LLM_Survey_Chinese.pdf.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07e923b3-b7ae-4983-abeb-2ce115566f15",
+   "metadata": {},
+   "source": [
+    "Download the text2vec model, for Chinese in our case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a07cd8d-3cec-40f6-8d2b-eb111aaf1164",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/GanymedeNil_text2vec-large-chinese.tar.gz\n",
+    "!tar -zxvf GanymedeNil_text2vec-large-chinese.tar.gz -C /tmp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc483af0-170e-4e61-8d25-a336d1592e34",
+   "metadata": {},
+   "source": [
+    "## Try out the model \n",
+    "\n",
+    "Load the Qwen-7B-Chat model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c112cf82-0447-46c4-9c32-18f243c0a686",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from abc import ABC\n",
+    "from langchain.llms.base import LLM\n",
+    "from typing import Any, List, Mapping, Optional\n",
+    "from langchain.callbacks.manager import CallbackManagerForLLMRun\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "\n",
+    "model_path=\"/tmp/models/Qwen/Qwen-7B-Chat\"\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).half().cuda()\n",
+    "model.eval()\n",
+    "\n",
+    "class Qwen(LLM, ABC):\n",
+    "    max_token: int = 10000\n",
+    "    temperature: float = 0.01\n",
+    "    top_p = 0.9\n",
+    "    history_len: int = 3\n",
+    "\n",
+    "    def __init__(self):\n",
+    "        super().__init__()\n",
+    "\n",
+    "    @property\n",
+    "    def _llm_type(self) -> str:\n",
+    "        return \"Qwen\"\n",
+    "\n",
+    "    @property\n",
+    "    def _history_len(self) -> int:\n",
+    "        return self.history_len\n",
+    "\n",
+    "    def set_history_len(self, history_len: int = 10) -> None:\n",
+    "        self.history_len = history_len\n",
+    "\n",
+    "    def _call(\n",
+    "        self,\n",
+    "        prompt: str,\n",
+    "        stop: Optional[List[str]] = None,\n",
+    "        run_manager: Optional[CallbackManagerForLLMRun] = None,\n",
+    "    ) -> str:\n",
+    "        response, _ = model.chat(tokenizer, prompt, history=[])\n",
+    "        return response\n",
+    "    \n",
+    "    @property\n",
+    "    def _identifying_params(self) -> Mapping[str, Any]:\n",
+    "        \"\"\"Get the identifying parameters.\"\"\"\n",
+    "        return {\"max_token\": self.max_token,\n",
+    "                \"temperature\": self.temperature,\n",
+    "                \"top_p\": self.top_p,\n",
+    "                \"history_len\": self.history_len}\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "382ed433-870f-424e-b074-210ea6f84b70",
+   "metadata": {},
+   "source": [
+    "Specify the txt file that needs retrieval for knowledge-based Q&A."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14be706b-4a7d-4906-9369-1f03c6c99854",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import torch\n",
+    "import argparse\n",
+    "from langchain.vectorstores import FAISS\n",
+    "from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
+    "from typing import List, Tuple\n",
+    "import numpy as np\n",
+    "from langchain.document_loaders import TextLoader\n",
+    "from chinese_text_splitter import ChineseTextSplitter\n",
+    "from langchain.docstore.document import Document\n",
+    "from langchain.prompts.prompt import PromptTemplate\n",
+    "from langchain.chains import RetrievalQA\n",
+    "\n",
+    "\n",
+    "def load_file(filepath, sentence_size=100):\n",
+    "    loader = TextLoader(filepath, autodetect_encoding=True)\n",
+    "    textsplitter = ChineseTextSplitter(pdf=False, sentence_size=sentence_size)\n",
+    "    docs = loader.load_and_split(textsplitter)\n",
+    "    write_check_file(filepath, docs)\n",
+    "    return docs\n",
+    "\n",
+    "\n",
+    "def write_check_file(filepath, docs):\n",
+    "    folder_path = os.path.join(os.path.dirname(filepath), \"tmp_files\")\n",
+    "    if not os.path.exists(folder_path):\n",
+    "        os.makedirs(folder_path)\n",
+    "    fp = os.path.join(folder_path, 'load_file.txt')\n",
+    "    with open(fp, 'a+', encoding='utf-8') as fout:\n",
+    "        fout.write(\"filepath=%s,len=%s\" % (filepath, len(docs)))\n",
+    "        fout.write('\\n')\n",
+    "        for i in docs:\n",
+    "            fout.write(str(i))\n",
+    "            fout.write('\\n')\n",
+    "        fout.close()\n",
+    "\n",
+    "        \n",
+    "def seperate_list(ls: List[int]) -> List[List[int]]:\n",
+    "    lists = []\n",
+    "    ls1 = [ls[0]]\n",
+    "    for i in range(1, len(ls)):\n",
+    "        if ls[i - 1] + 1 == ls[i]:\n",
+    "            ls1.append(ls[i])\n",
+    "        else:\n",
+    "            lists.append(ls1)\n",
+    "            ls1 = [ls[i]]\n",
+    "    lists.append(ls1)\n",
+    "    return lists\n",
+    "\n",
+    "\n",
+    "class FAISSWrapper(FAISS):\n",
+    "    chunk_size = 250\n",
+    "    chunk_conent = True\n",
+    "    score_threshold = 0\n",
+    "    \n",
+    "    def similarity_search_with_score_by_vector(\n",
+    "            self, embedding: List[float], k: int = 4\n",
+    "    ) -> List[Tuple[Document, float]]:\n",
+    "        scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)\n",
+    "        docs = []\n",
+    "        id_set = set()\n",
+    "        store_len = len(self.index_to_docstore_id)\n",
+    "        for j, i in enumerate(indices[0]):\n",
+    "            if i == -1 or 0 < self.score_threshold < scores[0][j]:\n",
+    "                # This happens when not enough docs are returned.\n",
+    "                continue\n",
+    "            _id = self.index_to_docstore_id[i]\n",
+    "            doc = self.docstore.search(_id)\n",
+    "            if not self.chunk_conent:\n",
+    "                if not isinstance(doc, Document):\n",
+    "                    raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n",
+    "                doc.metadata[\"score\"] = int(scores[0][j])\n",
+    "                docs.append(doc)\n",
+    "                continue\n",
+    "            id_set.add(i)\n",
+    "            docs_len = len(doc.page_content)\n",
+    "            for k in range(1, max(i, store_len - i)):\n",
+    "                break_flag = False\n",
+    "                for l in [i + k, i - k]:\n",
+    "                    if 0 <= l < len(self.index_to_docstore_id):\n",
+    "                        _id0 = self.index_to_docstore_id[l]\n",
+    "                        doc0 = self.docstore.search(_id0)\n",
+    "                        if docs_len + len(doc0.page_content) > self.chunk_size:\n",
+    "                            break_flag = True\n",
+    "                            break\n",
+    "                        elif doc0.metadata[\"source\"] == doc.metadata[\"source\"]:\n",
+    "                            docs_len += len(doc0.page_content)\n",
+    "                            id_set.add(l)\n",
+    "                if break_flag:\n",
+    "                    break\n",
+    "        if not self.chunk_conent:\n",
+    "            return docs\n",
+    "        if len(id_set) == 0 and self.score_threshold > 0:\n",
+    "            return []\n",
+    "        id_list = sorted(list(id_set))\n",
+    "        id_lists = seperate_list(id_list)\n",
+    "        for id_seq in id_lists:\n",
+    "            for id in id_seq:\n",
+    "                if id == id_seq[0]:\n",
+    "                    _id = self.index_to_docstore_id[id]\n",
+    "                    doc = self.docstore.search(_id)\n",
+    "                else:\n",
+    "                    _id0 = self.index_to_docstore_id[id]\n",
+    "                    doc0 = self.docstore.search(_id0)\n",
+    "                    doc.page_content += \" \" + doc0.page_content\n",
+    "            if not isinstance(doc, Document):\n",
+    "                raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n",
+    "            doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])\n",
+    "            doc.metadata[\"score\"] = int(doc_score)\n",
+    "            docs.append((doc, doc_score))\n",
+    "        return docs\n",
+    "\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    # load docs\n",
+    "    filepath = 'LLM_Survey_Chinese.pdf.txt'\n",
+    "    # LLM name\n",
+    "    LLM_TYPE = 'qwen'\n",
+    "    # Embedding model name\n",
+    "    EMBEDDING_MODEL = 'text2vec'\n",
+    "    # 基于上下文的prompt模版，请务必保留\"{question}\"和\"{context_str}\"\n",
+    "    PROMPT_TEMPLATE = \"\"\"已知信息：\n",
+    "    {context_str} \n",
+    "    根据上述已知信息，简洁和专业的来回答用户的问题。如果无法从中得到答案，请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”，不允许在答案中添加编造成分，答案请使用中文。 问题是：{question}\"\"\"\n",
+    "    # Embedding running device\n",
+    "    EMBEDDING_DEVICE = \"cuda\"\n",
+    "    # return top-k text chunk from vector store\n",
+    "    VECTOR_SEARCH_TOP_K = 3\n",
+    "    # 文本分句长度\n",
+    "    SENTENCE_SIZE = 50\n",
+    "    CHAIN_TYPE = 'stuff'\n",
+    "    llm_model_dict = {\n",
+    "        \"qwen\": QWen,\n",
+    "    }\n",
+    "    embedding_model_dict = {\n",
+    "        \"text2vec\": \"/tmp/GanymedeNil_text2vec-large-chinese\",\n",
+    "    }\n",
+    "    print(\"loading model start\")\n",
+    "    llm = llm_model_dict[LLM_TYPE]()\n",
+    "    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[EMBEDDING_MODEL],model_kwargs={'device': EMBEDDING_DEVICE})\n",
+    "    print(\"loading model done\")\n",
+    "\n",
+    "    print(\"loading documents start\")\n",
+    "    docs = load_file(filepath, sentence_size=SENTENCE_SIZE)\n",
+    "    print(\"loading documents done\")\n",
+    "\n",
+    "    print(\"embedding start\")\n",
+    "    docsearch = FAISSWrapper.from_documents(docs, embeddings)\n",
+    "    print(\"embedding done\")\n",
+    "\n",
+    "    print(\"loading qa start\")\n",
+    "    prompt = PromptTemplate(\n",
+    "        template=PROMPT_TEMPLATE, input_variables=[\"context_str\", \"question\"]\n",
+    "    )\n",
+    "\n",
+    "    chain_type_kwargs = {\"prompt\": prompt, \"document_variable_name\": \"context_str\"}\n",
+    "    qa = RetrievalQA.from_chain_type(\n",
+    "        llm=llm,\n",
+    "        chain_type=CHAIN_TYPE, \n",
+    "        retriever=docsearch.as_retriever(search_kwargs={\"k\": VECTOR_SEARCH_TOP_K}), \n",
+    "        chain_type_kwargs=chain_type_kwargs)\n",
+    "    print(\"loading qa done\")\n",
+    "\n",
+    "    query = \"大模型指令微调有好的策略？\"  \n",
+    "    print(qa.run(query))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/ascend/README.md
+++ b/recipes/finetune/ascend/README.md
@ -0,0 +1,142 @@
+# Fine-tuning Qwen by Ascend NPU
+Below, we provide a simple example to show how to finetune Qwen by Ascend NPU. You can also refer to the official [mindformers](https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/qwen.md) for detailed usage.
+
+## Environment Requirement
+
+- Hardware: Ascend 910A/B
+
+## Quickstart
+
+1. Launch Docker Image
+
+```bash
+ImageID=pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:qwen_v23.0.rc3
+docker run -it -u root --ipc=host \
+--device=/dev/davinci0 \
+--device=/dev/davinci1 \
+--device=/dev/davinci2 \
+--device=/dev/davinci3 \
+--device=/dev/davinci4 \
+--device=/dev/davinci5 \
+--device=/dev/davinci6 \
+--device=/dev/davinci7 \
+--device=/dev/davinci_manager \
+--device=/dev/devmm_svm \
+--device=/dev/hisi_hdc \
+-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /var/log/npu/:/usr/slog \
+-v /etc/hccn.conf:/etc/hccn.conf \
+${ImageID} /bin/bash
+```
+
+2. Download and Convert model
+
+- download model by modelscope
+
+```bash
+cd mindformers
+python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-7B-Chat', cache_dir='.', revision='master')"
+```
+
+- convert hf model weights to ckpt weights
+
+```bash
+python research/qwen/convert_weight.py \
+    --torch_ckpt_dir Qwen/Qwen-7B-Chat \
+    --mindspore_ckpt_path qwen-7b-chat.ckpt
+
+mkdir -vp load_checkpoint/rank_0
+mv qwen-7b-chat.ckpt load_checkpoint/rank_0/
+```
+
+3. Prepare training data
+
+- download demo data
+
+```bash
+wget -c https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/alpaca_data_min.json
+```
+
+- Converts the raw data to the specified format
+
+```bash
+python research/qwen/alpaca_converter.py \
+    --data_path alpaca_data_min.json \
+    --output_path alpaca-data-conversation_min.json
+```
+
+- Generate Mindrecord data
+
+```bash
+python research/qwen/qwen_preprocess.py \
+    --input_glob alpaca-data-conversation_min.json \
+    --model_file Qwen/Qwen-7B-Chat/qwen.tiktoken \
+    --seq_length 1024 \
+    --output_file alpaca_min.mindrecord
+```
+
+4. Prepare RANK_TABLE_FILE
+
+```bash
+# generate RANK_TABLE_FILE with 8 npu
+python mindformers/tools/hccl_tools.py --device_num "[0,8)"
+```
+
+5. Fine-tune
+
+You need to replace RANK_TABLE_FILE with the file generated in step 5.
+
+```bash
+export MS_ASCEND_CHECK_OVERFLOW_MODE=INFNAN_MODE
+bash research/run_singlenode.sh "python3 research/qwen/run_qwen.py \
+--config research/qwen/run_qwen_7b.yaml \
+--load_checkpoint /mindformers/research/qwen/load_checkpoint \
+--vocab_file Qwen/Qwen-7B-Chat/qwen.tiktoken \
+--use_parallel True \
+--run_mode finetune \
+--auto_trans_ckpt True \
+--train_data alpaca_min.mindrecord" \
+RANK_TABLE_FILE [0,8] 8
+```
+
+6. Merge model weights
+
+- Rename model weights
+
+```bash
+cd output/checkpoint_network
+mv rank_0/qwen_rank_0-network.ckpt rank_0/checkpoint_0.ckpt
+mv rank_1/qwen_rank_1-network.ckpt rank_1/checkpoint_1.ckpt
+mv rank_2/qwen_rank_2-network.ckpt rank_2/checkpoint_2.ckpt
+mv rank_3/qwen_rank_3-network.ckpt rank_3/checkpoint_3.ckpt
+mv rank_4/qwen_rank_4-network.ckpt rank_4/checkpoint_4.ckpt
+mv rank_5/qwen_rank_5-network.ckpt rank_5/checkpoint_5.ckpt
+mv rank_6/qwen_rank_6-network.ckpt rank_6/checkpoint_6.ckpt
+mv rank_7/qwen_rank_7-network.ckpt rank_7/checkpoint_7.ckpt
+cd ../..
+```
+
+- Merge model weights
+
+```bash
+python mindformers/tools/transform_ckpt.py \
+    --src_ckpt_strategy output/strategy  \
+    --src_ckpt_dir output/checkpoint_network \
+    --dst_ckpt_dir output/merged_model
+```
+
+7. Inference fine-tuned model
+
+```bash
+python research/qwen/run_qwen.py \
+    --config research/qwen/run_qwen_7b.yaml \
+    --predict_data '比较适合深度学习入门的书籍有' \
+    --run_mode predict \
+    --load_checkpoint output/merged_model/rank_0/checkpoint_0.ckpt \
+    --auto_trans_ckpt False \
+    --device_id 0
+```
--- a/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb
@ -0,0 +1,213 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "We can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb
@ -0,0 +1,234 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:19:11.059814Z",
+     "iopub.status.busy": "2023-12-31T03:19:11.059177Z",
+     "iopub.status.idle": "2023-12-31T03:21:54.157827Z",
+     "shell.execute_reply": "2023-12-31T03:21:54.157333Z",
+     "shell.execute_reply.started": "2023-12-31T03:19:11.059783Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:21:57.596577Z",
+     "iopub.status.busy": "2023-12-31T03:21:57.595847Z",
+     "iopub.status.idle": "2023-12-31T03:21:57.971112Z",
+     "shell.execute_reply": "2023-12-31T03:21:57.970576Z",
+     "shell.execute_reply.started": "2023-12-31T03:21:57.596555Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T03:23:52.455178Z",
+     "iopub.status.busy": "2023-12-31T03:23:52.454615Z",
+     "iopub.status.idle": "2023-12-31T03:24:15.699948Z",
+     "shell.execute_reply": "2023-12-31T03:24:15.699358Z",
+     "shell.execute_reply.started": "2023-12-31T03:23:52.455144Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "We can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb
@ -0,0 +1,267 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35acf008-1dfe-4d32-8cf5-7022e042aadb",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61021499-4a44-45af-a682-943ed63c2fcb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0dfbd261-6451-4532-82e8-3ae19ed93ee1",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ddcba069-340b-4a93-a145-2028b425dd23",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe9f2878-79d3-4b1c-ba95-ac2f73aa6e1b",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb
@ -0,0 +1,274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!export CUDA_VISIBLE_DEVICES=0\n",
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e6f28aa-1772-48ce-aa15-8cf29e7d67b5",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4fd5ef2a-34f9-4909-bebe-7b3b086fd16a",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e3f5b9f-63a1-4599-8d9b-a8d8f764838f",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10fa5ea3-dd55-4901-86af-c045d4c56533",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "804b84d8",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb
@ -0,0 +1,282 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T08:42:52.842315Z",
+     "iopub.status.busy": "2023-12-31T08:42:52.841665Z",
+     "iopub.status.idle": "2023-12-31T08:44:19.832661Z",
+     "shell.execute_reply": "2023-12-31T08:44:19.832193Z",
+     "shell.execute_reply.started": "2023-12-31T08:42:52.842295Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "execution": {
+     "iopub.execute_input": "2023-12-31T08:45:37.959631Z",
+     "iopub.status.busy": "2023-12-31T08:45:37.958961Z",
+     "iopub.status.idle": "2023-12-31T08:46:19.501657Z",
+     "shell.execute_reply": "2023-12-31T08:46:19.500854Z",
+     "shell.execute_reply.started": "2023-12-31T08:45:37.959609Z"
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\" \\\n",
+    "    --data_path \"Belle_sampled_qwen.json\" \\\n",
+    "    --bf16 True \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing True \\\n",
+    "    --lazy_preprocess True \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n",
+    "    --use_lora \\\n",
+    "    --q_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n",
+    "\n",
+    "You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n",
+    "\n",
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat-Int4/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb
+++ b/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb
@ -0,0 +1,283 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e6981ab-2d9a-4280-923f-235a166855ba",
+   "metadata": {},
+   "source": [
+    "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n",
+    "\n",
+    "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n",
+    "\n",
+    "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n",
+    "\n",
+    "## Environment Requirements\n",
+    "\n",
+    "Please refer to **requirements.txt** to install the required dependencies.\n",
+    "\n",
+    "## Preparation\n",
+    "\n",
+    "### Download Qwen-1.8B-Chat\n",
+    "\n",
+    "First, download the model files. You can choose to download directly from ModelScope.\n",
+    "\n",
+    "Note that we use the Int4 version of the models for QLoRA training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f",
+   "metadata": {},
+   "source": [
+    "### Download Example Training Data\n",
+    "\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n",
+    "\n",
+    "Disclaimer: the dataset can be only used for the research purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f",
+   "metadata": {},
+   "source": [
+    "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好，能告诉我遛狗的最佳时间吗？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"我在纽约市。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "## Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../../finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora \\\n",
+    "    --q_lora \\\n",
+    "    --deepspeed \"../../finetune/ds_config_zero2.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a50941d-3c3c-4ed2-9185-d4fe6172da2f",
+   "metadata": {},
+   "source": [
+    "## Merge Weights\n",
+    "\n",
+    "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n",
+    "\n",
+    "You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "909ff537-f851-488e-b1e8-1046f6852202",
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope.hub.snapshot_download import snapshot_download\n",
+    "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n",
+    "\n",
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7969df6e-ba8a-45f5-8b44-e1cbe74a8ef6",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c01b6a3f-036f-4b7c-b5a6-76a7b6894d4e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-1_8B-Chat-Int4/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2944b9b-89c7-4fb5-bd08-941d4706e943",
+   "metadata": {},
+   "source": [
+    "## Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b77abbb1-5b29-4eb1-8a6c-e2e146b8d33d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/recipes/finetune/deepspeed/readme.md
+++ b/recipes/finetune/deepspeed/readme.md
@ -0,0 +1,416 @@
+# Fine-tuning Qwen Using Deepspeed
+
+
+## TL;DR
+
+We provide the official training script `finetune.py` and serveral notebooks that can be leveraged for users to finetune pre-trained models for downstream applications in a simple fashion. The algorithms that we support include full-parameter fine-tuning, LoRA fine-tuning and Q-LoRA fine-tuning. Here is the matrix of our notebooks used in different settings:
+
+| Algorithm | Single GPU | Multiple GPUs|
+| --- | --- | --- |
+| Full-parameter Fine-tuning | [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb) | [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb) |
+| LoRA Fine-tuning | [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb) | [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb) |
+| Q-LoRA Fine-tuning | [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb) | [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb) |
+
+## Requirements
+
+### Environments
+
+The basic requirements for running Qwen models include:
+
+- python 3.8 and above
+- pytorch 1.12 and above, 2.0 and above are recommended
+- transformers 4.32 and above
+- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
+
+Our notebooks launch fine-tuning with DeepSpeed and Peft.
+(Note: this may have conflicts with the latest version of pydantic and you should use make sure `pydantic<2.0`.)
+You can install them by:
+```bash
+pip install peft deepspeed
+```
+
+### Settings and GPU Requirements
+
+We first provide the support matrix for different learning settings. Full-parameter fine-tuning requires updating all parameters in the whole training process.
+In comparison with full-parameter fine-tuning, LoRA only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. If you still suffer from insufficient memory, you can consider Q-LoRA, which uses the quantized large language model to allow even fewer memory costs. Generally, the GPU consumption rule for tuning Qwen is as follows: full parameter > full parameter (ZeRO2) > full parameter (ZeRO3) > LoRA > LoRA (ZeRO2) > LoRA (ZeRO3) > Q-LoRA > Q-LoRA (ZeRO2).
+
+| Setting | Full-parameter | LoRA | Q-LoRA |
+| --- | --- | --- | --- |
+| Base | Yes (up to ZeRO3) | Yes (up to ZeRO2) | No |
+| Chat | Yes (up to ZeRO3) | Yes (up to ZeRO3) | No |
+| Chat-Int4/8 | No | No | Yes |
+
+Here are some useful suggestions for choosing different fine-tuning settings based on GPU memory, espcially for users with GeForce RTX 3090/4090 (24GB) GPUs (or similar), and A100 (80GB) GPUs (or similar). In the experiments, we uniformly use a batch size of 1, gradient accumulation of 16, and max length of 512. Other parameters are set as the same shown in our notebooks. The results are as follows.
+
+| GPU Memory | Number of GPUs |  Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat |
+| --- | --- | --- | --- | --- |  --- |
+| 24GB | *1 | Full Parameter | LoRA | Q-LoRA | N/A |
+| 24GB | *2 | Full Parameter | LoRA | Q-LoRA | N/A |
+| 24GB | *4 | Full Parameter | LoRA | LoRA (w/ ZeRO3) | N/A |
+| 80GB | *1 | Full Parameter | LoRA | LoRA | Q-LoRA |
+| 80GB | *2 | Full Parameter | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO2) | TBD |
+| 80GB | *4 | Full Parameter | Full Parameter (w/ ZeRO2) | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO3) |
+
+Using other configurations of LoRA/Q-LoRA and ZeRO stages will easily result in failures.
+
+
+## Data Preparation
+
+To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
+```json
+[
+  {
+    "id": "identity_0",
+    "conversations": [
+      {
+        "from": "user",
+        "value": "你好"
+      },
+      {
+        "from": "assistant",
+        "value": "我是一个语言模型，我叫通义千问。"
+      }
+    ]
+  }
+]
+```
+
+You can also use multi-turn conversations as the training set. Here is a simple example:
+
+```json
+[
+  {
+    "id": "identity_0",
+    "conversations": [
+      {
+        "from": "user",
+        "value": "你好"
+      },
+      {
+        "from": "assistant",
+        "value": "你好！我是一名AI助手，我叫通义千问，有需要请告诉我。"
+      },
+      {
+        "from": "user",
+        "value": "你都能做什么"
+      },
+      {
+        "from": "assistant",
+        "value": "我能做很多事情，包括但不限于回答各种领域的问题、提供实用建议和指导、进行多轮对话交流、文本生成等。"
+      }
+    ]
+  }
+]
+```
+
+
+## Single-GPU Training
+
+In the single-GPU training setting, we provide three notebooks:
+
+- [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb)
+- [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb)
+- [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb)
+
+### Full-parameter Fine-tuning
+
+To launch your training, run the following command (with hyper-parameter settings omitted):
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT
+```
+Remember to specify the correct model name or path, the data path, as well as the output directory.
+
+### LoRA Fine-tuning
+
+Similarly, to run LoRA, use another notebook to run the command as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pre-trained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pre-trained model to load. 
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --use_lora
+```
+Note that if you use LoRA to fine-tune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Check out the following code in the training script `finetune.py`:
+```python
+is_chat_model = 'chat' in model_args.model_name_or_path.lower()
+if training_args.use_lora:
+  if lora_args.q_lora or is_chat_model:
+    modules_to_save = None
+  else:
+    modules_to_save = ["wte", "lm_head"]
+    lora_config = LoraConfig(
+            r=lora_args.lora_r,
+            lora_alpha=lora_args.lora_alpha,
+            target_modules=lora_args.lora_target_modules,
+            lora_dropout=lora_args.lora_dropout,
+            bias=lora_args.lora_bias,
+            task_type="CAUSAL_LM",
+            modules_to_save=modules_to_save  # This argument serves for adding new tokens.
+    )
+    ...
+    model = get_peft_model(model, lora_config)
+    ...
+```
+Pay attention that the script relies on the model path to identify the model type, so please keep `chat` in the chat model paths.
+
+
+
+### Q-LoRA Fine-tuning
+
+To run single-GPU Q-LoRA training, you may need to install `mpi4py`. Directly run the following script:
+```bash
+python finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --use_lora \
+    --q_lora \
+    --deepspeed "ds_config_zero2.json"
+```
+
+For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter fine-tuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use DeepSpeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work.
+
+
+In default, our notebooks provide training codes for Qwen-1.8B-Chat.
+You can also run the training script to fine-tune other version of the Qwen-series models. We profile the GPU memory usage of all versions based on our notebooks (without changing any hyper-parameter settings) on a single A800 GPU (80GB). The statistics are listed below:
+
+| Training | Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat |
+| --- | --- | --- | --- | --- |
+| Full Parameter | 19.6GB | 76.8GB | OOM | OOM |
+| LoRA | 7.4GB | 20.3GB | 34.2GB | OOM |
+| Q-LoRA | 6.1GB | 12.5GB | 17.8GB | 61.9GB |
+
+
+### Merging Weights from LoRA and Q-LoRA
+
+
+#### Inference with Adapters
+
+Different from full-parameter fine-tuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the fine-tuned model for inference as shown below:
+```python
+from peft import AutoPeftModelForCausalLM
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(
+    path_to_adapter, # path to the output directory
+    trust_remote_code=True
+)
+model = AutoPeftModelForCausalLM.from_pretrained(
+    path_to_adapter, # path to the output directory
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+response, history = model.chat(tokenizer, "你好", history=None)
+```
+
+#### Inference with Merged Weights
+
+If you want to merge the adapters and save the fine-tuned model as a standalone model, take LoRA as an example, you can run the following codes:
+```python
+from peft import AutoPeftModelForCausalLM
+
+model = AutoPeftModelForCausalLM.from_pretrained(
+    path_to_adapter, # path to the output directory
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+merged_model = model.merge_and_unload()
+# max_shard_size and safe serialization are not necessary. 
+# They respectively work for sharding checkpoint and save the model to safetensors.
+merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
+```
+
+The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(
+    path_to_adapter, # path to the output directory
+    trust_remote_code=True
+)
+tokenizer.save_pretrained(new_model_directory)
+```
+Next, the model with merged weights can be loaded by the following code:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(new_model_directory, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    new_model_directory,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+response, history = model.chat(tokenizer, "你好", history=None)
+```
+
+Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model. Take Qwen-7B-Chat-In4 as an example. 
+```python
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+import torch
+
+# Here, we load the original Qwen-7B-Chat model, instead of the Qwen-7B-Chat-Int4 model.
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
+# We merge the learned adapter to the Qwen-7B-Chat.
+model = PeftModel.from_pretrained(model, path_to_adapter)
+merged_model = model.merge_and_unload()
+# We save the model to a new path.
+merged_model.save_pretrained(path_to_new_model, max_shard_size="2048MB", safe_serialization=True)
+```
+
+
+## Multi-GPU Training
+
+In the multi-GPU training setting, we provide three notebooks:
+
+- [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb)
+- [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb)
+- [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb)
+
+We use `torchrun` to launch the training job on multiple GPUs:
+
+```bash
+# for full-parameter fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json"
+
+# for LoRA fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json" \
+    --use_lora
+
+# for Q-LoRA fine-tuning
+torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \
+    --model_name_or_path $MODEL \
+    --data_path  $DATA \
+    --output_dir $OUTPUT \
+    --deepspeed "ds_config_zero2.json" \
+    --use_lora \
+    --q_lora
+```
+
+For multi-GPU training, you also need to specify proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed.
+For the usage of `torchrun` and distrubuted arguments, please refer to [here](https://pytorch.org/docs/stable/elastic/run.html).
+Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA fine-tune the chat models. Check the profile below for more information. 
+
+
+### Multi-node Fine-tuning
+
+Our provided scripts also support multi-node fine-tuning. You can refer to the comments in the scripts to correctly set corresponding arguments and launch the script on each node. For more information about multi-node distributed training, please refer to [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
+
+Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multi-node fine-tuning scripts.
+
+### Profiling of Memory and Speed
+
+We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter fine-tuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. 
+
+For Qwen-7B, we also test the performance of multi-node fine-tuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multi-node fine-tuning are marked as LoRA (multinode) in the table.
+
+For Qwen-72B, we experiment in two ways: 1) LoRA fine-tuning + DeepSpeed ZeRO 3 on 4 A100-SXM4-80G GPUs and 2) Q-LoRA (int4) fine-tuning on a single A100-SXM4-80G GPU. Note that OOM occurs on 4 A100-SXM4-80G GPUs both with LoRA (emb) fine-tuning and LoRA fine-tuning without Deepspeed ZeRO 3 (you can pass `--deepspeed ds_config_zero3.json` to `finetune_lora_ds.sh` to enable DeepSpeed ZeRO 3).
+
+The statistics are listed below:
+
+<table>
+    <tr>
+      <th rowspan="2">Model Size</th><th rowspan="2">Method</th><th rowspan="2">#Nodes</th><th rowspan="2">#GPUs per node</th><th colspan="6" align="center">Sequence Length</th>
+    </tr>
+    <tr>
+        <th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th>
+    </tr>
+    <tr>
+        <th rowspan="4">1.8B</th><td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
+    </tr>
+    <tr>
+        <td>Full-parameter</td>
+        <td>1</td><td>1</td>
+        <td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="5">7B</th>
+        <td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">33.7G / 1.4s/it</td><td align="center">34.1G / 1.6s/it</td><td align="center">35.2G / 2.9s/it</td><td align="center">35.1G / 5.3s/it</td><td align="center">39.2G / 10.3s/it</td><td align="center">48.5G / 21.7s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">11.5G / 3.0s/it</td><td align="center">11.5G / 3.0s/it</td><td align="center">12.3G / 3.5s/it</td><td align="center">13.9G / 7.0s/it</td><td align="center">16.9G / 11.6s/it</td><td align="center">23.5G / 22.3s/it</td>
+    </tr>
+    <tr>
+        <td>Full-parameter</td>
+<td>1</td><td>2</td>
+<td align="center">139.2G / 4.0s/it</td><td align="center">148.0G / 4.0s/it</td><td align="center">162.0G / 4.5s/it</td><td align="center">-</td><td align="center">-</td><td align="center">-</td>
+    </tr>
+    <tr>
+        <td>LoRA (multinode)</td>
+        <td>2</td><td>2</td>
+        <td align="center">74.7G / 2.09s/it</td><td align="center">77.6G / 3.16s/it</td><td align="center">84.9G / 5.17s/it</td><td align="center">95.1G / 9.25s/it</td><td align="center">121.1G / 18.1s/it</td><td align="center">155.5G / 37.4s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="3">14B</th>
+        <td>LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">34.6G / 1.6s/it</td><td align="center">35.1G / 2.4s/it</td><td align="center">35.3G / 4.4s/it</td><td align="center">37.4G / 8.4s/it</td><td align="center">42.5G / 17.0s/it</td><td align="center">55.2G / 36.0s/it</td>
+    </tr>
+    <tr>
+        <td>LoRA (emb)</td>
+        <td>1</td><td>1</td>
+        <td align="center">51.2 / 1.7s/it</td><td align="center">51.1G / 2.6s/it</td><td align="center">51.5G / 4.6s/it</td><td align="center">54.1G / 8.6s/it</td><td align="center">56.8G / 17.2s/it</td><td align="center">67.7G / 36.3s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
+    </tr>
+    <tr>
+        <th rowspan="2">72B</th>
+        <td>LoRA + Deepspeed Zero3</td>
+        <td>1</td><td>4</td>
+        <td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
+    </tr>
+    <tr>
+        <td>Q-LoRA</td>
+        <td>1</td><td>1</td>
+        <td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
+    </tr>
+</table>
+<br>
+
+
+
+
+
+
+
+
+
+
+
--- a/recipes/finetune/deepspeed/requirements.txt
+++ b/recipes/finetune/deepspeed/requirements.txt
@ -0,0 +1,2 @@
+deepspeed
+peft
--- a/recipes/finetune/swift/README.md
+++ b/recipes/finetune/swift/README.md
@ -0,0 +1,198 @@
+## Introduction
+[SWIFT](https://github.com/modelscope/swift) (Scalable lightWeight Infrastructure for Fine-Tuning) is an extensible framwork designed to faciliate lightweight model fine-tuning and inference. It integrates implementations for various efficient fine-tuning methods, by embracing approaches that is parameter-efficient, memory-efficient, and time-efficient. SWIFT integrates seamlessly into ModelScope ecosystem and offers the capabilities to finetune various models, with a primary emphasis on LLMs and vision models. Additionally, SWIFT is fully compatible with PEFT, enabling users to leverage the familiar Peft interface to finetune ModelScope models.
+
+## Installation
+
+```shell
+# Set the global pip mirror
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# Install ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# If you want to use deepspeed
+pip install deepspeed -U
+
+# If you want to use qlora training based on auto_gptq (recommended, performs better than bnb)
+# Models supporting auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型`
+# There's a version correspondence between auto_gptq and cuda; refer to `https://github.com/PanQiWei/AutoGPTQ#quick-installation` for selecting the appropriate version
+pip install auto_gptq -U
+
+# If you want to use qlora training based on bnb
+pip install bitsandbytes -U
+
+# Environment alignment (run the following commands if you encounter errors; the repository is tested with the latest environment)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+## WebUI Usage
+
+Run the following command to start the webui and conduct model training and inference through the graphical interface:
+```shell
+swift web-ui
+```
+A screenshot example can be found at:
+![image](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/swift_webui.jpg)
+
+## Fine-tuning
+
+```python
+# Experimental environment: A10, 3090, V100, ...
+# GPU memory requirement: 20GB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# Use your own dataset
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --custom_train_dataset_path chatml.jsonl \
+    --output_dir output \
+
+# Using DDP (Distributed Data Parallel)
+# Experimental environment: 2 * 3090
+# GPU memory requirement: 2 * 23GB
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=2 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# Multi-machine multi-GPU setup
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+```
+For more fine-tuning methods, please refer to [here](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83).
+
+
+
+Examples
+
+| 模型名称          | 训练方法                                                                                             |
+|:-------------------|:---------------------------------------------------------------------------------------------------------------------------|
+| qwen_14b           | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds)             |
+| qwen_14b           | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora)                         |
+| qwen_14b           | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds)           |
+| qwen_14b_chat      | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds)        |
+| qwen_14b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora)                    |
+| qwen_14b_chat      | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds)      |
+| qwen_14b_chat_int4 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora)               |
+| qwen_14b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds) |
+| qwen_14b_chat_int8 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora)               |
+| qwen_14b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds) |
+| qwen_1_8b_chat     | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full)                     |
+| qwen_1_8b_chat     | [full_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full_ddp)             |
+| qwen_72b_chat      | [lora_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp)                |
+| qwen_72b_chat      | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp)        |
+| qwen_72b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/qlora)                    |
+| qwen_72b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int4/qlora_ddp_ds) |
+| qwen_72b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int8/qlora_ddp_ds) |
+| qwen_7b            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds)              |
+| qwen_7b            | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp)                  |
+| qwen_7b_chat       | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full)                       |
+| qwen_7b_chat       | [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp) |
+| qwen_7b_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp)                 |
+| qwen_7b_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp)         |
+| qwen_7b_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora)                       |
+| qwen_7b_chat       | [lora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp)               |
+| qwen_7b_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds)         |
+| qwen_7b_chat       | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp)         |
+| qwen_7b_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora)                     |
+| qwen_7b_chat       | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp)             |
+| qwen_7b_chat       | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds)       |
+| qwen_7b_chat_int4  | [qalora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qalora)              |
+| qwen_7b_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora)                |
+| qwen_7b_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds)  |
+| qwen_7b_chat_int8  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora)                |
+| qwen_7b_chat_int8  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds)  |
+| qwen_audio_chat    | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp)              |
+| qwen_audio_chat    | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp_ddp)      |
+| qwen_audio_chat    | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora)                    |
+| qwen_audio_chat    | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora_ddp_ds)      |
+| qwen_vl            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds)              |
+| qwen_vl_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp)                 |
+| qwen_vl_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp_ddp)         |
+| qwen_vl_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora)                       |
+| qwen_vl_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds)         |
+| qwen_vl_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/qlora)                     |
+| qwen_vl_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora)                |
+| qwen_vl_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds)  |
+
+
+## Inference
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
+)
+from swift.utils import seed_everything
+
+model_type = ModelType.qwen_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')  # template_type: qwen
+
+
+kwargs = {}
+# kwargs['use_flash_attn'] = True  # Use flash_attn if desired
+
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
+# Modify max_new_tokens
+model.generation_config.max_new_tokens = 128
+
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+query = 'What is the provincial capital of Zhejiang?'
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+query = 'What delicious food can be found here?'
+response, history = inference(model, template, query, history)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+
+"""Output[0]:
+query: What is the provincial capital of Zhejiang?
+response: The provincial capital of Zhejiang is Hangzhou.
+query: What delicious food can be found here?
+response: Hangzhou has many famous delicacies, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Spare Ribs, and Maodu. Additionally, there are unique Hangzhou-style pastries like Osmanthus Cake, Lotus Paste Pastry, and Aiwo Steamed Rice Cakes.
+history: [('What is the provincial capital of Zhejiang?', 'The provincial capital of Zhejiang is Hangzhou.'), ('What delicious food can be found here?', 'Hangzhou has many famous delicacies, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Spare Ribs, and Maodu. Additionally, there are unique Hangzhou-style pastries like Osmanthus Cake, Lotus Paste Pastry, and Aiwo Steamed Rice Cakes.')]
+"""
+
+# Streaming dialogue output with verbose mode
+inference(model, template, 'What was the first question?', history, verbose=True, stream=True)
+"""Output[1]:
+[PROMPT]
+You asked your first question, "What is the provincial capital of Zhejiang?"
+[OUTPUT] Your first question was “What is the provincial capital of Zhejiang?”
+"""
+
+For more on inference usage, please refer to [here](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM_Inference_Guide.md).
--- a/recipes/finetune/swift/README_CN.md
+++ b/recipes/finetune/swift/README_CN.md
@ -0,0 +1,203 @@
+## 介绍
+[SWIFT](https://github.com/modelscope/swift)（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展的轻量级一站式训练、推理深度学习框架。它集成了各种高效的微调方法，如LoRA、QLoRA、阿里云自研的ResTuning-Bypass等，以及开箱即用的训练推理脚本，使开发者可以在单张商业级显卡上微调推理LLM&AIGC模型。此外，SWIFT与PEFT完全兼容，使开发者可以在ModelScope模型体系中使用PEFT的能力。
+
+## 安装
+```shell
+# 设置pip全局镜像
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# 如果你想要使用deepspeed.
+pip install deepspeed -U
+
+# 如果你想要使用基于auto_gptq的qlora训练. (推荐, 效果优于bnb)
+# 支持auto_gptq的模型: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型`
+# auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
+pip install auto_gptq -U
+
+# 如果你想要使用基于bnb的qlora训练.
+pip install bitsandbytes -U
+
+# 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+
+## webui使用
+
+执行如下命令启动webui通过界面方式进行模型训练推理
+```shell
+swift web-ui
+```
+界面示例如下
+![image](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/swift_webui.jpg)
+
+## 微调
+```python
+# Experimental environment: A10, 3090, V100, ...
+# 20GB GPU memory
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# 使用自己的数据集
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --custom_train_dataset_path chatml.jsonl \
+    --output_dir output \
+
+# 使用DDP
+# Experimental environment: 2 * 3090
+# 2 * 23GB GPU memory
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=2 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+
+# 多机多卡
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=4 \
+swift sft \
+    --model_id_or_path qwen/Qwen-7B-Chat \
+    --dataset blossom-math-zh \
+    --output_dir output \
+```
+更多微调方法参考[这里](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83)
+
+已有微调代码示例
+| 模型名称          | 训练方法                                                                                             |
+|:-------------------|:---------------------------------------------------------------------------------------------------------------------------|
+| qwen_14b           | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/lora_ddp_ds)             |
+| qwen_14b           | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora)                         |
+| qwen_14b           | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b/qlora_ddp_ds)           |
+| qwen_14b_chat      | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_ds)        |
+| qwen_14b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora)                    |
+| qwen_14b_chat      | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/qlora_ddp_ds)      |
+| qwen_14b_chat_int4 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora)               |
+| qwen_14b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int4/qlora_ddp_ds) |
+| qwen_14b_chat_int8 | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora)               |
+| qwen_14b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat_int8/qlora_ddp_ds) |
+| qwen_1_8b_chat     | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full)                     |
+| qwen_1_8b_chat     | [full_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_1_8b_chat/full_ddp)             |
+| qwen_72b_chat      | [lora_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp)                |
+| qwen_72b_chat      | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp_ddp)        |
+| qwen_72b_chat      | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/qlora)                    |
+| qwen_72b_chat_int4 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int4/qlora_ddp_ds) |
+| qwen_72b_chat_int8 | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat_int8/qlora_ddp_ds) |
+| qwen_7b            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/lora_ddp_ds)              |
+| qwen_7b            | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b/qlora_ddp)                  |
+| qwen_7b_chat       | [full](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full)                       |
+| qwen_7b_chat       | [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp) |
+| qwen_7b_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp)                 |
+| qwen_7b_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp_ddp)         |
+| qwen_7b_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora)                       |
+| qwen_7b_chat       | [lora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp)               |
+| qwen_7b_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_ddp_ds)         |
+| qwen_7b_chat       | [lora_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/lora_mp_ddp)         |
+| qwen_7b_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora)                     |
+| qwen_7b_chat       | [qlora_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp)             |
+| qwen_7b_chat       | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora_ddp_ds)       |
+| qwen_7b_chat_int4  | [qalora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qalora)              |
+| qwen_7b_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora)                |
+| qwen_7b_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora_ddp_ds)  |
+| qwen_7b_chat_int8  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora)                |
+| qwen_7b_chat_int8  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int8/qlora_ddp_ds)  |
+| qwen_audio_chat    | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp)              |
+| qwen_audio_chat    | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/full_mp_ddp)      |
+| qwen_audio_chat    | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora)                    |
+| qwen_audio_chat    | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_audio_chat/lora_ddp_ds)      |
+| qwen_vl            | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl/lora_ddp_ds)              |
+| qwen_vl_chat       | [full_mp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp)                 |
+| qwen_vl_chat       | [full_mp_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/full_mp_ddp)         |
+| qwen_vl_chat       | [lora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora)                       |
+| qwen_vl_chat       | [lora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/lora_ddp_ds)         |
+| qwen_vl_chat       | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat/qlora)                     |
+| qwen_vl_chat_int4  | [qlora](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora)                |
+| qwen_vl_chat_int4  | [qlora_ddp_ds](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_vl_chat_int4/qlora_ddp_ds)  |
+
+## 推理
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
+)
+from swift.utils import seed_everything
+
+model_type = ModelType.qwen_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')  # template_type: qwen
+
+
+kwargs = {}
+# kwargs['use_flash_attn'] = True  # 使用flash_attn
+
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
+# 修改max_new_tokens
+model.generation_config.max_new_tokens = 128
+
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+query = '浙江的省会在哪里？'
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+query = '这有什么好吃的？'
+response, history = inference(model, template, query, history)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+
+"""Out[0]
+query: 浙江的省会在哪里？
+response: 浙江省的省会是杭州。
+query: 这有什么好吃的？
+response: 杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。
+history: [('浙江的省会在哪里？', '浙江省的省会是杭州。'), ('这有什么好吃的？', '杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。')]
+"""
+
+# 流式输出对话模板
+inference(model, template, '第一个问题是什么', history, verbose=True, stream=True)
+"""Out[1]
+[PROMPT]<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+浙江的省会在哪里？<|im_end|>
+<|im_start|>assistant
+浙江省的省会是杭州。<|im_end|>
+<|im_start|>user
+这有什么好吃的？<|im_end|>
+<|im_start|>assistant
+杭州市有很多著名的美食，例如西湖醋鱼、龙井虾仁、糖醋排骨、毛血旺等。此外，还有杭州特色的点心，如桂花糕、荷花酥、艾窝窝等。<|im_end|>
+<|im_start|>user
+第一个问题是什么<|im_end|>
+<|im_start|>assistant
+[OUTPUT]你的第一个问题是“浙江的省会在哪里？”<|im_end|>
+"""
+```
+更多推理使用请参考[这里](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E6%8E%A8%E7%90%86%E6%96%87%E6%A1%A3.md)
--- a/recipes/inference/dashscope/README.md
+++ b/recipes/inference/dashscope/README.md
@ -0,0 +1,56 @@
+# Inference Qwen Using DashScope
+
+The most simple way to use Qwen through APIs is DashScope API service through Alibaba Cloud. We give an introduction to the usage. Additionally, we provide a script for you to deploy an OpenAI-style API on your own servers.
+
+DashScope is the large language model API service provided by Alibaba Cloud, which now supports Qwen. Note that the models behind DashScope are in-house versions temporarily without details provided. The services include `qwen-turbo` and `qwen-plus`, where the former one runs faster and the latter achieves better performance. For more information, visit the documentation [here](https://dashscope.aliyun.com).
+
+Please head to the official website [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) to create a DashScope account and obtain the API key (AK). We recommend setting the AK with an environment variable:
+```bash
+export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
+```
+Then please install the packages and click [here](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) for the documentation. If you use Python, you can install DashScope with pip:
+```bash
+pip install dashscope
+```
+If you use JAVA SDK, you can install it in this way:
+```xml
+<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
+<dependency>
+    <groupId>com.alibaba</groupId>
+    <artifactId>dashscope-sdk-java</artifactId>
+    <version>the-latest-version</version>
+</dependency>
+```
+The simplest way to use DashScope is the usage with messages, which is similar to OpenAI API. The example is demonstrated below:
+```python
+import random
+from http import HTTPStatus
+from dashscope import Generation
+
+
+def call_with_messages():
+    messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
+                {'role': 'user', 'content': '如何做西红柿鸡蛋？'}]
+    gen = Generation()
+    response = gen.call(
+        Generation.Models.qwen_turbo,
+        messages=messages,
+        seed=random.randint(1, 10000),  # set the random seed, optional, default to 1234 if not set
+        result_format='message',  # set the result to be "message" format.
+    )
+    return response
+
+
+if __name__ == '__main__':
+    response = call_with_messages()
+    if response.status_code == HTTPStatus.OK:
+        print(response)
+    else:
+        print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
+            response.request_id, response.status_code,
+            response.code, response.message
+        ))
+```
+For more usages, please visit the official website for more details.
+<br><br>
+
--- a/recipes/inference/hf_modelscope/README.md
+++ b/recipes/inference/hf_modelscope/README.md
@ -0,0 +1,248 @@
+# Inference Qwen Using 🤖 ModelScope and 🤗 Transformers
+
+Below, we provide simple examples to show how to inference Qwen with 🤖 ModelScope and 🤗 Transformers.
+
+## Requirements
+
+* python 3.8 and above
+* pytorch 1.12 and above, 2.0 and above are recommended
+* transformers 4.32 and above
+* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
+<br>
+
+## Installation
+
+You can use our pre-built docker images to skip most of the environment setup steps, see Section ["Using Pre-built Docker Images"](https://github.com/QwenLM/Qwen?tab=readme-ov-file#-docker) for more details. 
+
+If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
+
+```bash
+pip install -r Qwen/requirements.txt
+```
+
+If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) (**we support flash attention 2 now.**) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
+
+```bash
+git clone https://github.com/Dao-AILab/flash-attention
+cd flash-attention && pip install .
+# Below are optional. Installing them might be slow.
+# pip install csrc/layer_norm
+# If the version of flash-attn is higher than 2.1.1, the following is not needed.
+# pip install csrc/rotary
+```
+
+Now you can start with ModelScope or Transformers.
+
+## 🤗 Transformers
+
+To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
+# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+
+# 1st dialogue turn
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+# 你好！很高兴为你提供帮助。
+
+# 2nd dialogue turn
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
+# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
+# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
+# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
+# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
+# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
+
+# 3rd dialogue turn
+response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
+print(response)
+# 《奋斗创业：一个年轻人的成功之路》
+```
+
+Running Qwen, the base language model, is also simple.
+
+<details>
+  <summary>Running Qwen</summary>
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" 
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+
+# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
+# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+
+inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
+inputs = inputs.to(model.device)
+pred = model.generate(**inputs)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
+```
+
+</details>
+
+<p id="DownloadModel">
+In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:
+</p>
+
+```python
+from modelscope import snapshot_download
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Downloading model checkpoint to a local dir model_dir
+# model_dir = snapshot_download('qwen/Qwen-7B')
+# model_dir = snapshot_download('qwen/Qwen-7B-Chat')
+# model_dir = snapshot_download('qwen/Qwen-14B')
+model_dir = snapshot_download('qwen/Qwen-14B-Chat')
+
+# Loading local checkpoints
+# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_dir,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+```
+
+## 🤖 ModelScope
+
+ModelScope is an open-source platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
+
+```python
+from modelscope import AutoModelForCausalLM, AutoTokenizer
+from modelscope import GenerationConfig
+
+# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
+tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
+
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history) 
+print(response)
+response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
+print(response)
+```
+
+## Batch Inference
+Qwen supports batch inference. With flash attention enabled, using batch inference can bring a 40% speedup. The example code is shown below:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import GenerationConfig
+from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids
+
+tokenizer = AutoTokenizer.from_pretrained(
+    './',
+    pad_token='<|extra_0|>',
+    eos_token='<|endoftext|>',
+    padding_side='left',
+    trust_remote_code=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    './',
+    pad_token_id=tokenizer.pad_token_id,
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id)
+
+all_raw_text = ["我想听你说爱我。", "今天我想吃点啥，甜甜的，推荐下", "我马上迟到了，怎么做才能不迟到"]
+batch_raw_text = []
+for q in all_raw_text:
+    raw_text, _ = make_context(
+        tokenizer,
+        q,
+        system="You are a helpful assistant.",
+        max_window_size=model.generation_config.max_window_size,
+        chat_format=model.generation_config.chat_format,
+    )
+    batch_raw_text.append(raw_text)
+
+batch_input_ids = tokenizer(batch_raw_text, padding='longest')
+batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
+batch_out_ids = model.generate(
+    batch_input_ids,
+    return_dict_in_generate=False,
+    generation_config=model.generation_config
+)
+padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
+
+batch_response = [
+    decode_tokens(
+        batch_out_ids[i][padding_lens[i]:],
+        tokenizer,
+        raw_text_len=len(batch_raw_text[i]),
+        context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
+        chat_format="chatml",
+        verbose=False,
+        errors='replace'
+    ) for i in range(len(all_raw_text))
+]
+print(batch_response)
+
+response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
+print(response)
+
+response, _ = model.chat(tokenizer, "今天我想吃点啥，甜甜的，推荐下", history=None)
+print(response)
+
+response, _ = model.chat(tokenizer, "我马上迟到了，怎么做才能不迟到", history=None)
+print(response)
+```
+
+## CPU
+
+To deploy our models on CPU, we strongly advise you to use [qwen.cpp](https://github.com/QwenLM/qwen.cpp), which is a pure C++ implementation of Qwen and tiktoken. Check the repo for more details!
+
+Also, it is also simple to directly run the model on CPU, which requires your specification of device:
+
+```python
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+```
+
+However, it is likely that you suffer from extremely low inference efficiency.
+
+## Multiple GPUs
+
+If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can directly use the default loading method, which is now supported by Transformers. The previous method based on `utils.py` is deprecated.
+
+However, though this method is simple, the efficiency of the native pipeline parallelism is low. We advise you to use vLLM with FastChat and please read [the section](../vllm/README.md) for deployment.
--- a/recipes/inference/quantization/README.md
+++ b/recipes/inference/quantization/README.md
@ -0,0 +1,113 @@
+# Quantization
+
+## GPTQ
+
+We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.
+
+Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
+
+```bash
+pip install auto-gptq optimum
+```
+
+If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.
+
+> Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update, 
+> you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`.
+> We recommend using the latest versions meeting the following requirements:
+> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
+> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
+
+Then you can load the quantized model easily and run inference as same as usual:
+
+```python
+# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat-Int4",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+response, history = model.chat(tokenizer, "Hi", history=None)
+```
+
+We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
+
+| Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
+|----------------------|:----:|:-----------:|:-----:|:---------:|
+| Qwen-1.8B-Chat (BF16)| 43.3 |    55.6     | 33.7  |   26.2    |
+| Qwen-1.8B-Chat (Int8)| 43.1 |    55.8     | 33.0  |   27.4    |
+| Qwen-1.8B-Chat (Int4)| 42.9 |    52.8     | 31.2  |   25.0    |
+| Qwen-7B-Chat (BF16)  | 55.8 |    59.7     | 50.3  |   37.2    |
+| Qwen-7B-Chat (Int8)  | 55.4 |    59.4     | 48.3  |   34.8    |
+| Qwen-7B-Chat (Int4)  | 55.1 |    59.2     | 49.7  |   29.9    |
+| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 60.1  |   43.9    |
+| Qwen-14B-Chat (Int8) | 63.6 |    68.6     | 60.0  |   48.2    |
+| Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |
+| Qwen-72B-Chat (BF16) | 74.4 |    80.1     | 76.4  |   64.6    |
+| Qwen-72B-Chat (Int8) | 73.5 |    80.1     | 73.5  |   62.2    |
+| Qwen-72B-Chat (Int4) | 73.4 |    80.1     | 75.3  |   61.6    |
+
+## Quantization of KV cache
+
+> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality 
+> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download
+> them from the Hugging Face Hub and place them into the same folder as the other module files.
+
+The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen-7B-Chat",
+     device_map="auto",
+     trust_remote_code=True,
+     use_cache_quantization=True,
+     use_cache_kernel=True,
+     use_flash_attn=False
+)
+```
+Attention: Currently, KV cache quantization and flash attention cannot be used at the same time.
+If you enable KV cache quantization and flash attention at the same time (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` is disabled by default (`use_flash_attn=false`).
+
+We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions. 
+The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. 
+We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error.
+
+With KV cache quantization, the model can infer with a larger batch size (bs).
+
+| USE KV Cache |  bs=1  |  bs=4  | bs=16  | bs=32  | bs=64  | bs=100 |
+|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
+| No           | 16.3GB | 24.1GB | 31.7GB | 48.7GB |  OOM   |  OOM   |
+| Yes          | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
+
+With KV cache quantization the model can save more memory when generating longer sequence (`sl`, sequence length, referring to the number of tokens generated) at the stage of inference.
+
+| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
+|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
+| No           | 15.2GB | 16.3GB  | 17.6GB  | 19.5GB  | 23.2GB  |
+| Yes          |  15GB  | 15.5GB  | 15.8GB  | 16.6GB  | 17.6GB  |
+
+The model with KV cache quantization will convert the format of `layer_past` from float to int8, and meanwhile the quantized `layer-past` will also store the quantization parameters.
+
+Specific steps are as follows:
+
+1. Quantize key/value
+```
+    qv,scale,zero_point=quantize_cache_v(v)
+```
+2. Store into layer_past
+
+The following is the format of quantized `layer_past`:
+```
+    layer_past=((q_key,key_scale,key_zero_point),
+                (q_value,value_scale,value_zero_point))
+```
+
+The original format of `layer_past` is shown below:
+```
+    layer_past=(key,value)
+```
+
+If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows:
+```
+    v=dequantize_cache_torch(qv,scale,zero_point)
+```
+<br>
--- a/recipes/inference/tensorrt/README.md
+++ b/recipes/inference/tensorrt/README.md
@ -0,0 +1,46 @@
+# Inference Qwen Using TensorRT-LLM
+Below, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We recommend using GPUs with compute capability of at least SM_80 such as A10 and A800 to run this example, as we have tested on these GPUs. You can find your gpu compute capability on this [link](https://developer.nvidia.com/cuda-gpus).
+
+## Installation
+You can use pre-built docker image to run this example. Simultaneously, You can also refer to the official [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for installation and detailed usage.
+```bash
+docker run --gpus all -it --ipc=host --network=host pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:tensorrt-llm-0.8.0 bash
+```
+## Quickstart
+1. Download model by modelscope
+
+```bash
+cd TensorRT-LLM/examples/qwen
+python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
+mkdir -p ./tmp/Qwen
+mv Qwen/Qwen-1_8B-Chat ./tmp/Qwen/1_8B
+```
+
+2. Build TensorRT engine from HF checkpoint
+
+```bash
+python3 build.py --hf_model_dir ./tmp/Qwen/1_8B/ \
+                --dtype float16 \
+                --remove_input_padding \
+                --use_gpt_attention_plugin float16 \
+                --enable_context_fmha \
+                --use_gemm_plugin float16 \
+                --output_dir ./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu/
+```
+
+3. Inference
+```bash
+python3 ../run.py --input_text "你好，请问你叫什么？" \
+                  --max_output_len=512 \
+                  --tokenizer_dir ./tmp/Qwen/1_8B/ \
+                  --engine_dir=./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu
+```
+```
+Input [Text 0]: "<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+你好，请问你叫什么？<|im_end|>
+<|im_start|>assistant
+"
+Output [Text 0 Beam 0]: "你好，我是来自阿里云的大规模语言模型，我叫通义千问。"
+```
--- a/recipes/inference/vllm/README.md
+++ b/recipes/inference/vllm/README.md
@ -0,0 +1,184 @@
+# Inference Qwen Using vLLM
+
+For deployment and fast inference, we suggest using vLLM. 
+
+## Installation
+
+If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
+```bash
+# Install vLLM with CUDA 12.1.
+pip install vllm
+```
+Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+
+If you have trouble building vLLM, we recommend using Docker image.
+
+```bash
+docker run --gpus all -it --rm --ipc=host --network=host qwenllm/qwen:cu121 bash
+```
+
+## GPU Requirements
+
+Qwen model use Bfloat16 by default, but Bfloat16 is only supported on GPUs with compute capability of at least 8. For GPUs with compute capability less than 8.0, it is recommended to set the dtype to float16. You can find your gpu compute capability on this [link](https://developer.nvidia.com/cuda-gpus).
+
+We have tested the GPU memory usage on NVIDIA Tesla V100 32GB by manually adjusting gpu-memory-utilization in eager mode, you can refer to the following table to determine whether your machine is capable of running these models.
+| Model | seq_len 2048 | seq_len 8192 | seq_len 16384 | seq_len 32768 |
+| :--- | ---: | ---: | ---: | ---: |
+| Qwen-1.8B | 6.22G | 7.46G |  |  |
+| Qwen-7B | 17.94G | 20.96G |  |  |
+| Qwen-7B-Int4 | 9.10G | 12.26G |  |  |
+| Qwen-14B | 33.40G |  |  |  |
+| Qwen-14B-Int4 | 13.30G |  |  |  |
+| Qwen-72B | 166.87G | 185.50G | 210.80G | 253.80G |
+| Qwen-72B-int4 | 55.37G | 73.66G | 97.79G | 158.80G |
+
+We have also listed the models that can run on consumer graphics cards by default sequence length in the following table. If the GPU memory only exceeds the model's memory usage by a small margin, you can make the model run on your machine by reducing the max-model-len parameter.</br>
+(ps: To run Qwen-14B-Int4 on NVIDIA RTX 3080Ti, you need to set gpu-memory-utilization as 0.99 and enforce eager mode)
+
+| GPU Memory | GPU | Support Model |
+| :---: | :---: | :---: |
+| 24GB | NVIDIA RTX 4090/3090/A5000 | Qwen-1.8B/Qwen-7B/Qwen-7B-Int4/Qwen-14B-Int4  |
+| 16GB | NVIDIA RTX A4000 | Qwen-1.8B/Qwen-7B-Int4/Qwen-14B-Int4 |
+| 12GB | NVIDIA RTX 3080Ti/TITAN Xp | Qwen-1.8B/Qwen-14B-Int4 |
+| 11GB | NVIDIA RTX 2080Ti/GTX 1080Ti | Qwen-1.8B |
+| 10GB | NVIDIA RTX 3080 | Qwen-1.8B |
+
+## Usage
+
+### vLLM + Web Demo / OpenAI-like API
+
+You can use FastChat to launch a web demo or an OpenAI API server. First, install FastChat:
+
+```bash
+pip install "fschat[model_worker,webui]=0.2.33" "openai<1.0"
+```
+
+To run Qwen with vLLM and FastChat, you need launch a controller by:
+```bash
+python -m fastchat.serve.controller
+```
+
+Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run:
+```bash
+python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 
+```
+
+However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below:
+```bash
+python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 
+```
+
+After launching your model worker, you can launch a:
+
+* Web UI Demo
+```bash
+python -m fastchat.serve.gradio_web_server
+```
+
+* OpenAI API
+```bash
+python -m fastchat.serve.openai_api_server --host localhost --port 8000
+```
+
+For OpenAI API server, you can invoke the server in the following manner.
+
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+
+# create a request activating streaming response
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True 
+    # Specifying stop words in streaming output format is not yet supported and is under development.
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+
+# create a request not activating streaming response
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
+)
+print(response.choices[0].message.content)
+```
+
+If you find `"POST /v1/chat/completions HTTP/1.1" 200 OK` in openai_api_server log, it indicates that the call was successful. 
+
+vLLM does not support dynamic-NTK ROPE. Therefore, extending long sequences for Qwen model may lead to quality degradation(even gibberish).
+
+### vLLM + Transformer-like Wrapper
+
+You can download the [wrapper codes](vllm_wrapper.py) and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the ``model.chat()`` method.)
+
+```python
+from vllm_wrapper import vLLMWrapper
+
+# Bfloat16 is only supported on GPUs with compute capability of at least 8.0, 
+model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
+
+# run int4 model or GPUs with compute capability less than 8.0
+# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', tensor_parallel_size=1, dtype="float16")
+
+response, history = model.chat(query="你好", history=None)
+print(response)
+response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+response, history = model.chat(query="给这个故事起一个标题", history=history)
+print(response)
+```
+### vLLM Standalone OpenAI-like API
+
+You can also deploy an OpenAI API server independently through vLLM. First, you need to download [chat template file](template_chatml.jinja).
+
+Then, you can launch an OpenAI API server by following command:
+
+```bash
+python -m vllm.entrypoints.openai.api_server --model $model_path --trust-remote-code --chat-template template_chatml.jinja
+
+# run int4 model or GPUs with compute capability less than 8.0
+# python -m vllm.entrypoints.openai.api_server --model $model_path --trust-remote-code --dtype float16 --chat-template template_chatml.jinja
+```
+
+For vLLM standalone OpenAI API server, You need to set the `stop_token_ids` parameter to `[151645]` or `stop` parameter to `["<|im_end|>"]` when invoking the server.
+
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+
+# create a request activating streaming response
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True, 
+    stop_token_ids=[151645]
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+
+# create a request not activating streaming response
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop_token_ids=[151645]
+)
+print(response.choices[0].message.content)
+```
--- a/recipes/inference/vllm/template_chatml.jinja
+++ b/recipes/inference/vllm/template_chatml.jinja
@ -0,0 +1,6 @@
+{% for message in messages %}
+{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}
+{{'<|im_start|>' + message['role'] + '\n' + message['content']}}
+{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}
+{% endfor %}
+{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}
--- a/recipes/inference/vllm/vllm_wrapper.py
+++ b/recipes/inference/vllm/vllm_wrapper.py
@ -0,0 +1,239 @@
+from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
+from typing import Optional, Callable, List, Tuple, Union
+import copy
+import torch
+from transformers import AutoTokenizer
+from transformers.generation.logits_process import LogitsProcessorList
+from packaging import version
+
+_ERROR_BAD_CHAT_FORMAT = """\
+We detect you are probably using the pretrained model (rather than chat model) for chatting, since the chat_format in generation_config is not "chatml".
+If you are directly using the model downloaded from Huggingface, please make sure you are using our "Qwen/Qwen-7B-Chat" Huggingface model (rather than "Qwen/Qwen-7B") when you call model.chat().
+我们检测到您可能在使用预训练模型（而非chat模型）进行多轮chat，因为您当前在generation_config指定的chat_format，并未设置为我们在对话中所支持的"chatml"格式。
+如果您在直接使用我们从Huggingface提供的模型，请确保您在调用model.chat()时，使用的是"Qwen/Qwen-7B-Chat"模型（而非"Qwen/Qwen-7B"预训练模型）。
+"""
+
+IMEND = "<|im_end|>"
+ENDOFTEXT = "<|endoftext|>"
+
+HistoryType = List[Tuple[str, str]]
+TokensType = List[int]
+BatchTokensType = List[List[int]]
+
+def get_stop_words_ids(chat_format, tokenizer):
+    if chat_format == "raw":
+        stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
+    elif chat_format == "chatml":
+        stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+    return stop_words_ids
+
+def make_context(
+    tokenizer: PreTrainedTokenizer,
+    query: str,
+    history: List[Tuple[str, str]] = None,
+    system: str = "",
+    max_window_size: int = 6144,
+    chat_format: str = "chatml",
+):
+    if history is None:
+        history = []
+
+    if chat_format == "chatml":
+        im_start_tokens = [tokenizer.im_start_id]
+        im_end_tokens = [tokenizer.im_end_id]
+        im_start, im_end = tokenizer.decode(im_start_tokens, skip_special_tokens=False), tokenizer.decode(im_end_tokens, skip_special_tokens=False)
+        nl_tokens = tokenizer.encode("\n")
+
+        def _tokenize_str(role, content):
+            return f"{role}\n{content}", tokenizer.encode(
+                role, allowed_special=set()
+            ) + nl_tokens + tokenizer.encode(content, allowed_special=set())
+
+        system_text, system_tokens_part = _tokenize_str("system", system)
+        system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
+
+        raw_text = ""
+        context_tokens = []
+
+        for turn_query, turn_response in reversed(history):
+            query_text, query_tokens_part = _tokenize_str("user", turn_query)
+            query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
+            response_text, response_tokens_part = _tokenize_str(
+                "assistant", turn_response
+            )
+            response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
+
+            next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
+            prev_chat = (
+                f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
+            )
+
+            current_context_size = (
+                len(system_tokens) + len(next_context_tokens) + len(context_tokens)
+            )
+            if current_context_size < max_window_size:
+                context_tokens = next_context_tokens + context_tokens
+                raw_text = prev_chat + raw_text
+            else:
+                break
+
+        context_tokens = system_tokens + context_tokens
+        raw_text = f"{im_start}{system_text}{im_end}" + raw_text
+        context_tokens += (
+            nl_tokens
+            + im_start_tokens
+            + _tokenize_str("user", query)[1]
+            + im_end_tokens
+            + nl_tokens
+            + im_start_tokens
+            + tokenizer.encode("assistant")
+            + nl_tokens
+        )
+        raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
+
+    elif chat_format == "raw":
+        raw_text = query
+        context_tokens = tokenizer.encode(raw_text)
+    else:
+        raise NotImplementedError(f"Unknown chat format {chat_format!r}")
+
+    return raw_text, context_tokens
+
+class vLLMWrapper:
+    def __init__(self,
+               model_dir: str,
+               trust_remote_code: bool = True,
+               tensor_parallel_size: int = 1,
+               gpu_memory_utilization: float = 0.98,
+               dtype: str = "bfloat16",
+               **kwargs):
+
+        if dtype not in ("bfloat16", "float16", "float32"):
+            print("now not support {}!".format(dtype))
+            raise Exception
+
+        # build generation_config
+        self.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
+
+        # build tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+        self.tokenizer.eos_token_id = self.generation_config.eos_token_id
+
+        self.stop_words_ids = []
+
+        from vllm import LLM
+        import vllm
+        if version.parse(vllm.__version__) >= version.parse("0.2.2"):
+            self.__vllm_support_repetition_penalty = True
+        else:
+            self.__vllm_support_repetition_penalty = False
+
+        quantization = getattr(kwargs, 'quantization', None)
+
+        self.model = LLM(model=model_dir,
+                            tokenizer=model_dir,
+                            tensor_parallel_size=tensor_parallel_size,
+                            trust_remote_code=trust_remote_code,
+                            quantization=quantization,
+                            gpu_memory_utilization=gpu_memory_utilization,
+                            dtype=dtype)
+
+        for stop_id in get_stop_words_ids(self.generation_config.chat_format, self.tokenizer):
+            self.stop_words_ids.extend(stop_id)
+        self.stop_words_ids.extend([self.generation_config.eos_token_id])
+
+    def chat(self,
+        query: str,
+        history: Optional[HistoryType],
+        tokenizer: PreTrainedTokenizer = None,
+        system: str = "You are a helpful assistant.",
+        generation_config: Optional[GenerationConfig] = None,
+        **kwargs):
+        generation_config = generation_config if generation_config is not None else self.generation_config
+        tokenizer = self.tokenizer if tokenizer is None else tokenizer
+
+        assert generation_config.chat_format == 'chatml', _ERROR_BAD_CHAT_FORMAT
+        if not self.__vllm_support_repetition_penalty and generation_config.repetition_penalty != 1:
+            raise RuntimeError("The installed vLLM doesn't support repetition_penalty, please set ``model.generation_config.repetition_penalty = 1`` or install vllm>=0.2.2")
+
+        if history is None:
+            history = []
+        else:
+            # make a copy of the user's input such that is is left untouched
+            history = copy.deepcopy(history)
+
+        extra_stop_words_ids = kwargs.get('stop_words_ids', None)
+        if extra_stop_words_ids is None:
+            extra_stop_words_ids = []
+
+        max_window_size = kwargs.get('max_window_size', None)
+        if max_window_size is None:
+            max_window_size = generation_config.max_window_size
+
+        from vllm.sampling_params import SamplingParams
+        sampling_kwargs = {
+            "stop_token_ids": self.stop_words_ids,
+            "early_stopping": False,
+            "top_p": generation_config.top_p,
+            "top_k": -1 if generation_config.top_k == 0 else generation_config.top_k,
+            "temperature": generation_config.temperature,
+            "max_tokens": generation_config.max_new_tokens,
+            "repetition_penalty": generation_config.repetition_penalty
+        }
+        if not self.__vllm_support_repetition_penalty:
+            sampling_kwargs.pop("repetition_penalty")
+        sampling_params = SamplingParams(**sampling_kwargs)
+
+        raw_text, context_tokens = make_context(
+            self.tokenizer,
+            query,
+            history=history,
+            system=system,
+            max_window_size=max_window_size,
+            chat_format=generation_config.chat_format,
+        )
+
+        req_outputs = self.model.generate([query],
+                                            sampling_params=sampling_params,
+                                            prompt_token_ids=[context_tokens])
+        req_output = req_outputs[0]
+
+        prompt_str = req_output.prompt
+        prompt_ids = req_output.prompt_token_ids
+        req_sample_output_ids = []
+        req_sample_output_strs = []
+        for sample in req_output.outputs:
+            output_str = sample.text
+            output_ids = sample.token_ids
+            if IMEND in output_str:
+                output_str = output_str[:-len(IMEND)]
+            if ENDOFTEXT in output_str:
+                output_str = output_str[:-len(ENDOFTEXT)]
+            req_sample_output_ids.append(prompt_ids + output_ids)
+            req_sample_output_strs.append(prompt_str + output_str)
+        assert len(req_sample_output_strs) == 1
+        response = req_sample_output_strs[0][len(prompt_str):]
+        history.append((prompt_str, response))
+
+        return response, history
+
+if __name__ == '__main__':
+
+    model_dir = 'Qwen/Qwen-72B-Chat'
+    tensor_parallel_size = 2
+
+    model = vLLMWrapper(model_dir,
+                        tensor_parallel_size=tensor_parallel_size,
+                        )
+
+    response, history = model.chat(query="你好",
+                                   history=None)
+    print(response)
+    response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
+                                   history=history)
+    print(response)
+    response, history = model.chat(query="给这个故事起一个标题",
+                                   history=history)
+    print(response)
--- a/recipes/quickstart/qwen.ipynb
+++ b/recipes/quickstart/qwen.ipynb
@ -0,0 +1,421 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Qwen Quick Start Notebook\n",
+    "\n",
+    "This notebook shows how to train and infer the Qwen-7B-Chat model on a single GPU. Similarly, Qwen-1.8B-Chat, Qwen-14B-Chat can also be leveraged for the following steps. We only need to modify the corresponding `model name` and hyper-parameters. The training and inference of Qwen-72B-Chat requires higher GPU requirements and larger disk space."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Requirements\n",
+    "- Python 3.8 and above\n",
+    "- Pytorch 1.12 and above, 2.0 and above are recommended\n",
+    "- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)\n",
+    "We test the training of the model on an A10 GPU (24GB)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extra\n",
+    "If you need to speed up, you can install  `flash-attention`. The details of the installation can be found [here](https://github.com/Dao-AILab/flash-attention)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/Dao-AILab/flash-attention\n",
+    "!cd flash-attention && pip install .\n",
+    "# Below are optional. Installing them might be slow.\n",
+    "# !pip install csrc/layer_norm\n",
+    "# If the version of flash-attn is higher than 2.1.1, the following is not needed.\n",
+    "# !pip install csrc/rotary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 0: Install Package Requirements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!pip install transformers>=4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed modelscope"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Download Model\n",
+    "When using `transformers` in some regions, the model cannot be automatically downloaded due to network problems. We recommend using `modelscope` to download the model first, and then use `transformers` for inference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope import snapshot_download\n",
+    "\n",
+    "# Downloading model checkpoint to a local dir model_dir.\n",
+    "model_dir = snapshot_download('Qwen/Qwen-7B-Chat', cache_dir='.', revision='master')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Direct Model Inference \n",
+    "We recommend two ways to do model inference: `modelscope` and `transformers`.\n",
+    "\n",
+    "#### 2.1 Model Inference with ModelScope"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from modelscope import AutoModelForCausalLM, AutoTokenizer\n",
+    "from modelscope import GenerationConfig\n",
+    "\n",
+    "# Note: The default behavior now has injection attack prevention off.\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen-7B-Chat/\", trust_remote_code=True)\n",
+    "\n",
+    "# use bf16\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"qwen/Qwen-7B-Chat/\", device_map=\"auto\", trust_remote_code=True, bf16=True).eval()\n",
+    "# use fp16\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"qwen/Qwen-7B-Chat/\", device_map=\"auto\", trust_remote_code=True, fp16=True).eval()\n",
+    "# use cpu only\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"qwen/Qwen-7B-Chat/\", device_map=\"cpu\", trust_remote_code=True).eval()\n",
+    "# use auto mode, automatically select precision based on the device.\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-Chat/\", device_map=\"auto\", trust_remote_code=True).eval()\n",
+    "\n",
+    "# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.\n",
+    "# model.generation_config = GenerationConfig.from_pretrained(\"Qwen/Qwen-7B-Chat/\", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参\n",
+    "\n",
+    "# 第一轮对话 1st dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)\n",
+    "# 你好！很高兴为你提供帮助。\n",
+    "\n",
+    "# 第二轮对话 2nd dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"给我讲一个年轻人奋斗创业最终取得成功的故事。\", history=history)\n",
+    "print(response)\n",
+    "# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。\n",
+    "# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。\n",
+    "# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。\n",
+    "# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。\n",
+    "# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。\n",
+    "# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。\n",
+    "\n",
+    "# 第三轮对话 3rd dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"给这个故事起一个标题\", history=history)\n",
+    "print(response)\n",
+    "# 《奋斗创业：一个年轻人的成功之路》"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 2.2 Model Inference with transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecutionIndicator": {
+     "show": true
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen-7B-Chat/\", trust_remote_code=True)\n",
+    "\n",
+    "# use bf16\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-Chat/\", device_map=\"auto\", trust_remote_code=True, bf16=True).eval()\n",
+    "# use fp16\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-Chat/\", device_map=\"auto\", trust_remote_code=True, fp16=True).eval()\n",
+    "# use cpu only\n",
+    "# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-Chat/\", device_map=\"cpu\", trust_remote_code=True).eval()\n",
+    "# use auto mode, automatically select precision based on the device.\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"Qwen/Qwen-7B-Chat/\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.\n",
+    "# model.generation_config = GenerationConfig.from_pretrained(\"Qwen/Qwen-7B-Chat/\", trust_remote_code=True)\n",
+    "\n",
+    "# 1st dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)\n",
+    "# 你好！很高兴为你提供帮助。\n",
+    "\n",
+    "# 2nd dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"给我讲一个年轻人奋斗创业最终取得成功的故事。\", history=history)\n",
+    "print(response)\n",
+    "# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。\n",
+    "# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。\n",
+    "# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。\n",
+    "# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。\n",
+    "# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。\n",
+    "# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。\n",
+    "\n",
+    "# 3rd dialogue turn\n",
+    "response, history = model.chat(tokenizer, \"给这个故事起一个标题\", history=history)\n",
+    "print(response)\n",
+    "# 《奋斗创业：一个年轻人的成功之路》"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: LoRA Fine-Tuning Model (Single GPU)\n",
+    "\n",
+    "#### 3.1 Download Example Training Data\n",
+    "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我是一个语言模型，我叫通义千问。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "You can also use multi-turn conversations as the training set. Here is a simple example:\n",
+    "\n",
+    "```json\n",
+    "[\n",
+    "  {\n",
+    "    \"id\": \"identity_0\",\n",
+    "    \"conversations\": [\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你好\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"你好！我是一名AI助手，我叫通义千问，有需要请告诉我。\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"user\",\n",
+    "        \"value\": \"你都能做什么\"\n",
+    "      },\n",
+    "      {\n",
+    "        \"from\": \"assistant\",\n",
+    "        \"value\": \"我能做很多事情，包括但不限于回答各种领域的问题、提供实用建议和指导、进行多轮对话交流、文本生成等。\"\n",
+    "      }\n",
+    "    ]\n",
+    "  }\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "#### 3.2 Fine-Tune the Model\n",
+    "\n",
+    "You can directly run the prepared training script to fine-tune the model. Remember to check `model_name_or_path`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python ../finetune/deepspeed/finetune.py \\\n",
+    "    --model_name_or_path \"Qwen/Qwen-7B-Chat/\"\\\n",
+    "    --data_path  \"Belle_sampled_qwen.json\"\\\n",
+    "    --bf16 \\\n",
+    "    --output_dir \"output_qwen\" \\\n",
+    "    --num_train_epochs 5 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --evaluation_strategy \"no\" \\\n",
+    "    --save_strategy \"steps\" \\\n",
+    "    --save_steps 1000 \\\n",
+    "    --save_total_limit 10 \\\n",
+    "    --learning_rate 1e-5 \\\n",
+    "    --weight_decay 0.1 \\\n",
+    "    --adam_beta2 0.95 \\\n",
+    "    --warmup_ratio 0.01 \\\n",
+    "    --lr_scheduler_type \"cosine\" \\\n",
+    "    --logging_steps 1 \\\n",
+    "    --report_to \"none\" \\\n",
+    "    --model_max_length 512 \\\n",
+    "    --gradient_checkpointing \\\n",
+    "    --lazy_preprocess \\\n",
+    "    --use_lora"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.3 Merge Weights\n",
+    "\n",
+    "LoRA training only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n",
+    "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    \"Qwen/Qwen-7B-Chat/\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "tokenizer.save_pretrained(\"output_qwen_merged\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.4 Test the Model\n",
+    "\n",
+    "After merging the weights, we can test the model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from transformers.generation import GenerationConfig\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"output_qwen_merged\",\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ").eval()\n",
+    "\n",
+    "response, history = model.chat(tokenizer, \"你好\", history=None)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/recipes/tests/README.md
+++ b/recipes/tests/README.md
@ -0,0 +1,13 @@
+# Unit testing
+- Run all unit testing
+```bash
+cd tests && pytest -s 
+```
+- Run unit testing under a single folder
+```bash
+cd tests && pytest -s {dir}
+```
+- Rerun the test cases that failed in the last run
+```bash
+cd tests && pytest -s --lf
+```
--- a/recipes/tests/init.py
+++ b/recipes/tests/init.py
--- a/recipes/tests/assets/test_sampled_qwen.json
+++ b/recipes/tests/assets/test_sampled_qwen.json
@ -0,0 +1 @@
+[{"conversations": [{"from": "user", "value": "你好"}, {"from": "assistant", "value": "你好！很高兴为你提供帮助。"}], "id": "identity_0"}, {"conversations": [{"from": "user", "value": "给我讲一个年轻人奋斗创业最终取得成功的故事。"}, {"from": "assistant", "value": "这是一个关于一个年轻人奋斗创业最终取得成功的故事。"}], "id": "identity_1"}]
--- a/recipes/tests/test_finetune/test_finetune_ds.py
+++ b/recipes/tests/test_finetune/test_finetune_ds.py
@ -0,0 +1,100 @@
+import os
+import sys
+import pytest
+import shutil
+from itertools import product
+import torch
+from modelscope.hub.snapshot_download import snapshot_download
+
+sys.path.append(os.path.dirname(__file__) + "/..")
+from utils import run_in_subprocess
+from ut_config import (
+    MODEL_TYPE,
+    DOCKER_VERSION_CU114,
+    DOCKER_VERSION_CU117,
+    DOCKER_VERSION_CU121,
+    DOCKER_MOUNT_DIR,
+    DOCKER_TEST_DIR,
+    DATA_DIR,
+    DS_CONFIG_ZERO2_DIR,
+    DS_CONFIG_ZERO3_DIR,
+)
+
+is_chat = ["chat", "base"]
+docker_version = [DOCKER_VERSION_CU114, DOCKER_VERSION_CU117, DOCKER_VERSION_CU121]
+# ZeRO3 is incompatible with LoRA when finetuning on base model.
+# FSDP or ZeRO3 are incompatible with QLoRA.
+parametrize_list_none_ds = list(
+    product(*[[1], ["full", "lora"], is_chat, docker_version, [None]])
+)
+parametrize_list_ds_zero2 = list(
+    product(*[[2], ["full", "lora"], is_chat, docker_version, [DS_CONFIG_ZERO2_DIR]])
+)
+parametrize_list_ds_zero3 = list(
+    product(*[[2], ["full"], is_chat, docker_version, [DS_CONFIG_ZERO3_DIR]])
+) + list(product(*[[2], ["lora"], ["chat"], docker_version, [DS_CONFIG_ZERO3_DIR]]))
+parametrize_list_qlora = list(
+    product(*[[1, 2], ["qlora"], ["chat"], docker_version, [None, DS_CONFIG_ZERO2_DIR]])
+)
+parametrize_list = (
+    parametrize_list_none_ds
+    + parametrize_list_ds_zero2
+    + parametrize_list_ds_zero3
+    + parametrize_list_qlora
+)
+
+
+@pytest.mark.parametrize(
+    "num_gpus,train_type,is_chat,docker_version,deepspeed", parametrize_list
+)
+def test_finetune(num_gpus, train_type, is_chat, docker_version, deepspeed):
+    cmd_docker = f"docker run --gpus all --ipc=host --network=host --rm -v {os.getcwd()}/../../../Qwen:{DOCKER_MOUNT_DIR} {docker_version} /bin/bash -c "
+    cmd = ""
+    # for GPUs SM < 80
+    is_ampere = torch.cuda.get_device_capability()[0] >= 8
+    if not is_ampere:
+        cmd = f"pip uninstall -y flash-attn && "
+
+    model_type = f"{MODEL_TYPE}-Chat" if is_chat == "chat" else MODEL_TYPE
+    model_type = f"{model_type}-Int4" if train_type == "qlora" else model_type
+    cmd += f"""torchrun --nproc_per_node {num_gpus} --nnodes 1 --node_rank 0 --master_addr localhost --master_port 12345 {DOCKER_MOUNT_DIR}/finetune.py \
+    --model_name_or_path "{DOCKER_TEST_DIR}/{model_type}/" \
+    --data_path  {DATA_DIR} \
+    --output_dir "{DOCKER_TEST_DIR}/output_qwen" \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --learning_rate 1e-5 \
+    --weight_decay 0.1 \
+    --adam_beta2 0.95 \
+    --warmup_ratio 0.01 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --report_to "none" \
+    --model_max_length 512"""
+    if deepspeed:
+        cmd += f" --deepspeed {deepspeed}"
+    if train_type == "lora":
+        cmd += " --use_lora"
+    elif train_type == "qlora":
+        cmd += " --use_lora --q_lora"
+    # for SM < 80
+    if (
+        (not is_ampere)
+        and train_type == "lora"
+        and (deepspeed and "zero2" in deepspeed)
+        and is_chat == "base"
+    ):
+        cmd += " --fp16 True"
+    snapshot_download(model_type, cache_dir=".", revision="master")
+    run_in_subprocess(cmd_docker + f'"{cmd}"')
+    if train_type == "full":
+        assert os.path.exists("output_qwen/config.json")
+    else:
+        assert os.path.exists("output_qwen/adapter_config.json")
+    shutil.rmtree("output_qwen")
--- a/recipes/tests/test_inference/test_inference_api.py
+++ b/recipes/tests/test_inference/test_inference_api.py
@ -0,0 +1,78 @@
+import os
+import sys
+import time
+import pytest
+import subprocess
+import torch
+from modelscope.hub.snapshot_download import snapshot_download
+
+sys.path.append(os.path.dirname(__file__) + "/..")
+from utils import run_in_subprocess, simple_openai_api, TelnetPort
+from ut_config import (
+    MODEL_TYPE,
+    DOCKER_VERSION_CU114,
+    DOCKER_VERSION_CU117,
+    DOCKER_VERSION_CU121,
+    DOCKER_MOUNT_DIR,
+    DOCKER_TEST_DIR,
+)
+
+
+# use_cpu=True,use_int=False RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
+# use_cpu=True,use_int4=True ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
+@pytest.mark.parametrize(
+    "docker_version,use_cpu,use_int4",
+    [
+        (DOCKER_VERSION_CU114, False, False),
+        (DOCKER_VERSION_CU114, False, True),
+        (DOCKER_VERSION_CU117, False, False),
+        (DOCKER_VERSION_CU117, False, True),
+        (DOCKER_VERSION_CU121, False, False),
+        (DOCKER_VERSION_CU121, False, True),
+    ],
+)
+def test_inference_api(docker_version, use_cpu, use_int4):
+    container_name = "test_inference_api"
+    model_type = f"{MODEL_TYPE}-Chat-Int4" if use_int4 else f"{MODEL_TYPE}-Chat"
+    cmd_docker = f'docker run --gpus all --ipc=host --network=host --rm --name="{container_name}" -p 8000:8000 -v {os.getcwd()}/../../../Qwen:{DOCKER_MOUNT_DIR} {docker_version} /bin/bash -c '
+    cmd = ""
+    # for GPUs SM < 80
+    is_ampere = torch.cuda.get_device_capability()[0] >= 8
+    if not is_ampere:
+        cmd += f"pip uninstall -y flash-attn && "
+
+    cmd += f"""python {DOCKER_MOUNT_DIR}/openai_api.py -c {DOCKER_TEST_DIR}/{model_type}"""
+
+    if use_cpu:
+        cmd += " --cpu-only"
+
+    snapshot_download(model_type, cache_dir=".", revision="master")
+    # start model server
+    print(cmd_docker + f'"{cmd}"')
+    run_in_subprocess(
+        f'docker rm -f {container_name} 2>/dev/null || echo "The container does not exist."'
+    )
+    run_in_subprocess("nohup " + cmd_docker + f'"{cmd}"' + " > tmp.log 2>&1 &")
+
+    while not TelnetPort("localhost", 8000):
+        print("Wait for the model service start.")
+        time.sleep(0.5)
+
+        if (
+            subprocess.run(
+                f"docker inspect {container_name}",
+                shell=True,
+                stdout=subprocess.DEVNULL,
+            ).returncode
+            != 0
+        ):
+            break
+    try:
+        # while load int4 model such as Qwen-1_8B-Chat-Int4, the model name is Qwen-1_8B-Chat
+        simple_openai_api(f"{MODEL_TYPE}-Chat".split("/")[-1])
+    except Exception as e:
+        time.sleep(1)
+        with open("tmp.log") as f:
+            raise Exception(f"{e} \n {f.read()}")
+
+    run_in_subprocess(f"docker rm -f {container_name}")
--- a/recipes/tests/test_inference/test_inference_vllm_fschat.py
+++ b/recipes/tests/test_inference/test_inference_vllm_fschat.py
@ -0,0 +1,73 @@
+import os
+import sys
+import time
+import pytest
+import subprocess
+import torch
+from modelscope.hub.snapshot_download import snapshot_download
+
+sys.path.append(os.path.dirname(__file__) + "/..")
+from utils import run_in_subprocess, simple_openai_api, TelnetPort
+from ut_config import (
+    MODEL_TYPE,
+    DOCKER_VERSION_CU121,
+    DOCKER_MOUNT_DIR,
+    DOCKER_TEST_DIR,
+)
+
+
+@pytest.mark.parametrize(
+    "num_gpus,use_int4",
+    [
+        (1, False),
+        (1, True),
+        (2, False),
+        # ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
+        # (2, True)
+    ],
+)
+def test_inference_vllm_fschat(num_gpus, use_int4):
+    model_type = f"{MODEL_TYPE}-Chat-Int4" if use_int4 else f"{MODEL_TYPE}-Chat"
+    container_name = "test_inference_vllm_fschat"
+    cmd_docker = f'docker run --gpus all --ipc=host --network=host --rm --name="{container_name}" -p 8000:8000 -v {os.getcwd()}/../../../Qwen:{DOCKER_MOUNT_DIR} {DOCKER_VERSION_CU121} /bin/bash -c '
+    cmd = ""
+
+    cmd += f"""nohup python -m fastchat.serve.controller > /dev/null 2>&1 \
+    & python -m fastchat.serve.openai_api_server --host localhost --port 8000 > /dev/null 2>&1 \
+    & python -m fastchat.serve.vllm_worker --model-path {DOCKER_TEST_DIR}/{model_type} --tensor-parallel-size {num_gpus} --trust-remote-code"""
+
+    # for GPUS SM < 80 and use_int==True
+    is_ampere = torch.cuda.get_device_capability()[0] >= 8
+    if not is_ampere or use_int4:
+        cmd += " --dtype half"
+
+    snapshot_download(model_type, cache_dir=".", revision="master")
+    # start model server
+    run_in_subprocess(
+        f'docker rm -f {container_name} 2>/dev/null || echo "The container does not exist."'
+    )
+    print(cmd_docker + f'"{cmd}"')
+    run_in_subprocess("nohup " + cmd_docker + f'"{cmd}"' + " > tmp.log 2>&1 &")
+
+    while not TelnetPort("localhost", 21002):
+        print("Wait for the model service start.")
+        time.sleep(0.5)
+
+        if (
+            subprocess.run(
+                f"docker inspect {container_name}",
+                shell=True,
+                stdout=subprocess.DEVNULL,
+            ).returncode
+            != 0
+        ):
+            break
+
+    try:
+        simple_openai_api(model_type.split("/")[-1])
+    except Exception as e:
+        time.sleep(1)
+        with open("tmp.log") as f:
+            raise Exception(f"{e} \n {f.read()}")
+
+    run_in_subprocess(f"docker rm -f {container_name}")
--- a/recipes/tests/ut_config.py
+++ b/recipes/tests/ut_config.py
@ -0,0 +1,18 @@
+import os
+
+# common
+MODEL_TYPE = "Qwen/Qwen-1_8B"
+DOCKER_VERSION_CU114 = "qwenllm/qwen:cu114"
+DOCKER_VERSION_CU117 = "qwenllm/qwen:cu117"
+DOCKER_VERSION_CU121 = "qwenllm/qwen:cu121"
+DOCKER_MOUNT_DIR = "/qwen-recipes"
+DOCKER_TEST_DIR = os.path.join(DOCKER_MOUNT_DIR, "recipes/tests")
+
+# finetune
+DATA_DIR = os.path.join(DOCKER_MOUNT_DIR, "recipes/tests/assets/test_sampled_qwen.json")
+DS_CONFIG_ZERO2_DIR = os.path.join(
+    DOCKER_MOUNT_DIR, "finetune/ds_config_zero2.json"
+)
+DS_CONFIG_ZERO3_DIR = os.path.join(
+    DOCKER_MOUNT_DIR, "finetune/ds_config_zero3.json"
+)
--- a/recipes/tests/utils.py
+++ b/recipes/tests/utils.py
@ -0,0 +1,61 @@
+import logging
+import subprocess
+import socket
+import openai
+
+
+def run_in_subprocess(cmd):
+    try:
+        with subprocess.Popen(
+            cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+        ) as return_info:
+            while True:
+                next_line = return_info.stdout.readline()
+                return_line = next_line.decode("utf-8", "ignore").strip()
+                if return_line == "" and return_info.poll() != None:
+                    break
+                if return_line != "":
+                    logging.info(return_line)
+
+            err_lines = ""
+            while True:
+                next_line = return_info.stderr.readline()
+                return_line = next_line.decode("utf-8", "ignore").strip()
+                if return_line == "" and return_info.poll() != None:
+                    break
+                if return_line != "":
+                    logging.info(return_line)
+                    err_lines += return_line + "\n"
+
+            return_code = return_info.wait()
+            if return_code:
+                raise RuntimeError(err_lines)
+    except Exception as e:
+        raise e
+
+
+def simple_openai_api(model):
+    openai.api_base = "http://localhost:8000/v1"
+    openai.api_key = "none"
+
+    # create a request not activating streaming response
+    response = openai.ChatCompletion.create(
+        model=model,
+        messages=[{"role": "user", "content": "你好"}],
+        stream=False,
+        stop=[],  # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
+    )
+    print(response.choices[0].message.content)
+
+
+def TelnetPort(server_ip, port):
+    sk = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    sk.settimeout(1)
+    connect_flag = False
+    try:
+        sk.connect((server_ip, port))
+        connect_flag = True
+    except Exception:
+        connect_flag = False
+    sk.close()
+    return connect_flag
				`@ -0,0 +1 @@`
				`[{"conversations": [{"from": "user", "value": "你好"}, {"from": "assistant", "value": "你好！很高兴为你提供帮助。"}], "id": "identity_0"}, {"conversations": [{"from": "user", "value": "给我讲一个年轻人奋斗创业最终取得成功的故事。"}, {"from": "assistant", "value": "这是一个关于一个年轻人奋斗创业最终取得成功的故事。"}], "id": "identity_1"}]`