diff --git a/recipes/applications/chatbot/qwen_chatbot.ipynb b/recipes/applications/chatbot/qwen_chatbot.ipynb new file mode 100644 index 0000000..3c27437 --- /dev/null +++ b/recipes/applications/chatbot/qwen_chatbot.ipynb @@ -0,0 +1,131 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "54d5d255-aa98-4655-8dd1-bc726430d86a", + "metadata": {}, + "source": [ + "# Qwen-7B-Chat Chat Demo" + ] + }, + { + "cell_type": "markdown", + "id": "31e04af4-eb27-4802-a7b2-6ea0525f1dc8", + "metadata": {}, + "source": [ + "This notebook uses Qwen-7B-Chat as an example to introduce you to how to build a web-based conversational assistant using Gradio." + ] + }, + { + "cell_type": "markdown", + "id": "75e51155-9f8e-40dc-8432-60f4567d93a8", + "metadata": {}, + "source": [ + "## Preparation" + ] + }, + { + "cell_type": "markdown", + "id": "ff6f061c-a033-49f2-8f7d-af3f23ac9125", + "metadata": {}, + "source": [ + "Download Qwen-7B-Chat\n", + "\n", + "Firstly, we need to download the model. You can use the snapshot_download that comes with modelscope to download the model to a specified directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c469a129-451f-4d01-8bc0-e2cf70a262c8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install modelscope" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69af626e-22b8-49ad-8869-8354f4c72bcc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "snapshot_download(\"qwen/Qwen-7B-Chat\",cache_dir='/tmp/models') " + ] + }, + { + "cell_type": "markdown", + "id": "01d2ff34-4053-4710-a289-e354673be1ca", + "metadata": {}, + "source": [ + "## Install Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48b51791-4bbc-4d12-9cd6-587c24c8bea7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install -r ../../../requirements.txt\n", + "!pip install gradio==3.37.0 mdtex2html" + ] + }, + { + "cell_type": "markdown", + "id": "7732037a-246a-4953-af07-dae7a3ae5937", + "metadata": {}, + "source": [ + "## Run the web UI code to start the Qwen chatbot\n", + "\n", + "Users can run the web_demo.py file to have real-time conversations with Qwen-7b-chat on the webpage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e256f0a-d96d-4fd7-b305-fe43c6959dc8", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!python ../../../web_demo.py -c /tmp/models/qwen/Qwen-7B-Chat" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 + } + \ No newline at end of file diff --git a/recipes/applications/domain_finetune/qwen_domain_finetune.ipynb b/recipes/applications/domain_finetune/qwen_domain_finetune.ipynb new file mode 100644 index 0000000..b1be94e --- /dev/null +++ b/recipes/applications/domain_finetune/qwen_domain_finetune.ipynb @@ -0,0 +1,433 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0e7993c3-3999-4ac5-b1dc-77875d80e4c8", + "metadata": {}, + "source": [ + "# Fine-tuning Qwen-7B-Chat on Your Own Domain-Specific Data\n", + "\n", + "This notebook uses Qwen-7B-Chat as an example to introduce how to LoRA fine-tune the Qwen model on a specific domain.\n", + "\n", + "Qwen is a pretrained conversational model that supports English and Chinese. It is suitable for universal scenarios, but may lack some specialized knowledge in certain specific domain. If you would like to fine-tune it for a specific domain, or on your own private dataset, you can refer to this tutorial.\n", + "\n", + "Here is an example showing the differences before and after fine-tuning.\n", + "\n", + "#### Model Input\n", + ">怎样治疗青光眼?\n", + "\n", + "#### Before Fine-tuning\n", + "\n", + ">青光眼的治疗方法取决于病情严重程度。轻度青光眼可以采取药物治疗,比如激素类药物、抗生素等,以减轻炎症和改善血管紧张。\n", + ">\n", + ">中度青光眼可能需要激光手术来降低眼压,并使用药物以保持控制。\n", + ">\n", + ">重度青光眼则需要实施传统的眼部手术,如小梁切除术或巩膜外减压术,以防止视神经损伤。\n", + "\n", + "#### After Fine-tuning\n", + "\n", + ">青光眼是一种眼睛疾病,可以引起视神经的损伤和视力下降。治疗青光眼的主要方法是药物治疗、激光治疗和手术治疗。\n", + ">1. 药物治疗:对于早期青光眼患者,通常采用抗青光眼药物治疗,通过降低眼压来预防视神经进一步损伤。目前常用的抗青光眼药物有拉坦前列素滴眼液、阿托品滴眼液等。\n", + ">2. 激光治疗:对于中晚期青光眼患者,如果药物治疗效果不佳,可能需要进行激光治疗。目前常用的激光治疗方法包括小梁切除术、YAG激光手术等。\n", + ">3. 手术治疗:对于非常严重的青光眼患者,可能需要进行手术治疗。目前常用的手术治疗方法包括前房引流术、虹膜周切术、小梁切除术、巩膜滤过术等。\n", + ">\n", + ">需要注意的是,以上治疗方法的具体选择应根据患者的病情和医生的专业建议来进行。同时,青光眼患者在日常生活中也需要保持良好的生活习惯,避免过度疲劳、保持情绪稳定、定期检查眼睛等情况的发生。" + ] + }, + { + "cell_type": "markdown", + "id": "bdea7e21-fec8-49fe-b7ea-afde3f02738f", + "metadata": {}, + "source": [ + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "Run the following command line in the main directory of the Qwen repo.\n", + "```bash\n", + "pip install -r requirements.txt\n", + "```\n", + "\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-7B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-7B-chat', cache_dir='.')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Medical Training Data\n", + "\n", + "Download the data required for training; here, we provide a medical conversation dataset for training. It is sampled from [MedicalGPT repo](https://github.com/shibing624/MedicalGPT/) and we have converted this dataset into a format that can be used for fine-tuning.\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/medical_sft.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can prepare your dataset in JSON format following the format below, and then modify the `--data_path` parameter in the training command to point to your JSON file.\n", + "\n", + "These data instances can be conversations in the real world or include domain knowledge QA pairs. Besides, fine-tuning allows Qwen-chat to play like some specific roles. As Qwen-chat is a dialogue model for general scenarios, your fine-tuning can customize a chatbot to meet your requirements.\n", + "\n", + "We recommend that you prepare 50~ data instances if you want to fine-tune Qwen-chat as a roleplay model.\n", + "\n", + "You may prepare much more data instances if you want to infuse the domain knowledge of your field into the model.\n", + "\n", + "In this tutorial, we have prepared a medical domain fine-tuning dataset consisting of 1000 data instancess as an example. You can refer to our example to fine-tune on your own domain-specific dataset.\n", + "\n", + "Below is a simple example list with 1 sample:\n", + "\n", + "```python\n", + "[\n", + " {\n", + " \"id\": \"1\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"直肠腺瘤的早期症状?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"结直肠腺瘤是起源于结直肠黏膜腺上皮的良性肿瘤,包括结肠腺瘤与直肠腺瘤,是常见的肠道良性肿瘤。因与大肠癌的发生关系密切,被认为是一种癌前病变。不同地区、不同年龄的发病率差别很大,40岁以下的发病率低,60岁以上较高,男女无明显差别。此类疾病的病因及发病机制不明确。部分患者有遗传因素存在,另外,高脂肪饮食与食物纤维不足、肠道菌群紊乱也与此病的发生有相关性。多数患者无症状。症状与腺瘤的大小、部位、病理性质等有关。有症状者可有以下表现:直肠腺瘤的临床常见症状有便血,多为无疼性便血,常呈鲜红色,量多,可引起贫血。粪便性状改变,多为粘液便,便意不尽或里急重感,有时只排出粘液,称为假性腹泻。1.腹部不适、腹部疼痛,多为大的腺瘤伴发肠套叠、肠梗阻引起。2.排便习惯改变,包括便秘、腹泻、里急后重等。3.粪便带血,最常见为间歇性便血。4.部分位于直肠的较大的、带蒂腺瘤可在排便时脱落或脱出肛。可以采取内镜下高频电凝、激光、微波凝固等方法切除,也可以选择外科手术切除,并定期随访。有恶变者根据情况选择其他治疗(如放疗、化疗、手术等)管状腺瘤切除术后复发者少见,但绒毛状腺瘤及绒毛管状腺瘤切除术后常可复发,尤其是绒毛状腺瘤,且多发的腺瘤复发率高于单发者。对于经内镜治疗或局部手术切除的结直肠腺瘤患者尤其是绒毛状腺瘤或广基的绒毛管状腺瘤患者,建议腺瘤切除后的第一年内3~6个月进行一次肠镜检查,第二年开始每年一次。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```python\n", + "[\n", + " {\n", + " \"id\": \"2\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,我最近经常感觉胸口疼痛,这是怎么回事?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"胸痛可能由多种原因引起,包括心脏问题、消化系统疾病、呼吸系统问题等。您能描述一下疼痛的性质和持续时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"疼痛感觉像是挤压,大概持续了几分钟。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"挤压感的胸痛可能与心脏问题有关,特别是如果它伴随着呼吸困难、出汗或恶心。我建议您尽快去看医生并进行适当的检查,如心电图和血液检测,以确定具体原因。\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我应该去急诊室吗,还是预约我的家庭医生?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"如果您的疼痛是突然发生的,并且还有其他症状,建议您立即去急诊室。如果疼痛不是很严重且没有其他严重症状,您可以预约家庭医生进一步评估。但请不要忽视疼痛,尤其是如果这种情况是第一次出现。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model. \n", + "\n", + "For parameter settings, you can modify `--model_name_or_path` to the location of the model you want to fine-tune, and set `--data_path` to the location of the dataset.\n", + "\n", + "You should remove the `--bf16` parameter if you are using a non-Ampere architecture GPU, such as a V100. \n", + "\n", + "For `--model_max_length` and `--per_device_train_batch_size`, we recommend the following configurations, ,you can refer to [this document](../../finetune/deepspeed/readme.md) for more details:\n", + "\n", + "| --model_max_length | --per_device_train_batch_size | GPU Memory |\n", + "|-----------------|------------|--------------------|\n", + "| 512 | 4 | 24g |\n", + "| 1024 | 3 | 24g |\n", + "| 512 | 8 | 32g |\n", + "| 1024 | 6 | 32g |\n", + "\n", + "You can use our recommended saving parameters, or you can save by epoch by just setting `--save_strategy \"epoch\"` if you prefer to save at each epoch stage. `--save_total_limit` means the limit on the number of saved checkpoints.\n", + "\n", + "For other parameters, such as `--weight_decay` and `--adam_beta2`, we recommend using the values we provided blow.\n", + "\n", + "Setting the parameters `--gradient_checkpointing` and `--lazy_preprocess` is to save GPU memory.\n", + "\n", + "The parameters for the trained Lora module will be saved in the **output_qwen** folder." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!python ../../../finetune/finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-7B-Chat/\"\\\n", + " --data_path \"medical_sft.json\"\\\n", + " --bf16 \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 4\\\n", + " --per_device_train_batch_size 4 \\\n", + " --per_device_eval_batch_size 3 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"epoch\" \\\n", + " --save_steps 3000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 10 \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing \\\n", + " --lazy_preprocess \\\n", + " --use_lora" + ] + }, + { + "cell_type": "markdown", + "id": "5e6f28aa-1772-48ce-aa15-8cf29e7d67b5", + "metadata": {}, + "source": [ + "## Merge Weights\n", + "\n", + "The LoRA training only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "4fd5ef2a-34f9-4909-bebe-7b3b086fd16a", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2024-01-26T02:46:14.585746Z", + "iopub.status.busy": "2024-01-26T02:46:14.585089Z", + "iopub.status.idle": "2024-01-26T02:47:08.095464Z", + "shell.execute_reply": "2024-01-26T02:47:08.094715Z", + "shell.execute_reply.started": "2024-01-26T02:46:14.585720Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to \"AutoModelForCausalLM.from_pretrained\".\n", + "Try importing flash-attention for faster inference...\n", + "Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm\n", + "Loading checkpoint shards: 100%|██████████| 8/8 [00:06<00:00, 1.14it/s]\n" + ] + } + ], + "source": [ + "from transformers import AutoModelForCausalLM\n", + "from peft import PeftModel\n", + "import torch\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-7B-chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n", + "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n", + "merged_model = model.merge_and_unload()\n", + "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)" + ] + }, + { + "cell_type": "markdown", + "id": "2e3f5b9f-63a1-4599-8d9b-a8d8f764838f", + "metadata": {}, + "source": [ + "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "10fa5ea3-dd55-4901-86af-c045d4c56533", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2024-01-26T02:47:08.097051Z", + "iopub.status.busy": "2024-01-26T02:47:08.096744Z", + "iopub.status.idle": "2024-01-26T02:47:08.591289Z", + "shell.execute_reply": "2024-01-26T02:47:08.590665Z", + "shell.execute_reply.started": "2024-01-26T02:47:08.097029Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "('output_qwen_merged/tokenizer_config.json',\n", + " 'output_qwen_merged/special_tokens_map.json',\n", + " 'output_qwen_merged/qwen.tiktoken',\n", + " 'output_qwen_merged/added_tokens.json')" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\n", + " \"Qwen/Qwen-7B-chat/\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "tokenizer.save_pretrained(\"output_qwen_merged\")" + ] + }, + { + "cell_type": "markdown", + "id": "804b84d8", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "After merging the weights, we can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "dbae310c", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2024-01-26T02:48:29.995040Z", + "iopub.status.busy": "2024-01-26T02:48:29.994448Z", + "iopub.status.idle": "2024-01-26T02:48:41.677104Z", + "shell.execute_reply": "2024-01-26T02:48:41.676591Z", + "shell.execute_reply.started": "2024-01-26T02:48:29.995019Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm\n", + "Loading checkpoint shards: 100%|██████████| 8/8 [00:04<00:00, 1.71it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "VDAC1(电压依赖性钙通道)是一种位于细胞膜上的钙离子通道,负责将细胞内的钙离子释放到细胞外。它在神经信号传导、肌肉收缩和血管舒张中发挥着重要作用。\n", + "\n", + "VDAC1通常由4个亚基组成,每个亚基都有不同的功能。其中,一个亚基是内腔部分,它与钙离子的结合有关;另一个亚基是外腔部分,它与离子通道的打开和关闭有关;第三个亚基是一层跨膜蛋白,它负责调节通道的开放程度;最后一个亚基是一个膜骨架连接器,它帮助维持通道的结构稳定性。\n", + "\n", + "除了钙离子外,VDAC1还能够接收钾离子和氯离子等其他离子,并将其从细胞内释放到细胞外。此外,VDAC1还参与了许多细胞代谢反应,例如脂肪酸合成和糖原分解等。\n", + "\n", + "总的来说,VDAC1是细胞膜上的一种重要离子通道,其作用涉及到许多重要的生物学过程。\n" + ] + } + ], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen_merged\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"什么是VDAC1?\", history=None)\n", + "print(response)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "987f524d-6918-48ae-a730-f285cf6f8416", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/applications/retrieval/retrieval.ipynb b/recipes/applications/retrieval/retrieval.ipynb new file mode 100644 index 0000000..ed51b05 --- /dev/null +++ b/recipes/applications/retrieval/retrieval.ipynb @@ -0,0 +1,411 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "245ab07a-fb2f-4cf4-ab9a-5c05a9b44daa", + "metadata": {}, + "source": [ + "# LangChain retrieval knowledge base Q&A based on Qwen-7B-Chat" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e8df2cb7-a69c-4231-9596-4c871d893633", + "metadata": {}, + "source": [ + "This notebook introduces a question-answering application based on a local knowledge base using Qwen-7B-Chat with langchain. The goal is to establish a knowledge base Q&A solution that is friendly to many scenarios and open-source models, and that can run offline. The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the LLM (Large Language Model) to generate an answer." + ] + }, + { + "cell_type": "markdown", + "id": "92e9c81a-45c7-4c12-91af-3c5dd52f63bb", + "metadata": {}, + "source": [ + "## Preparation" + ] + }, + { + "cell_type": "markdown", + "id": "84cfcf88-3bef-4412-a658-4eaefeb6502a", + "metadata": {}, + "source": [ + "Download Qwen-7B-Chat\n", + "\n", + "Firstly, we need to download the model. You can use the snapshot_download that comes with modelscope to download the model to a specified directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c1f9ded-8035-42c7-82c7-444ce06572bc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install modelscope" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c26225c-c958-429e-b81d-2de9820670c2", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "snapshot_download(\"Qwen/Qwen-7B-Chat\",cache_dir='/tmp/models') " + ] + }, + { + "cell_type": "markdown", + "id": "e8f51796-49fa-467d-a825-ae9a281eb3fd", + "metadata": {}, + "source": [ + "Download the dependencies for langchain and Qwen." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87fe1023-644f-4610-afaf-0b7cddc30d60", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install langchain==0.0.187 dashscope==1.0.4 sentencepiece==0.1.99 cpm_kernels==1.0.11 nltk==3.8.1 sentence_transformers==2.2.2 unstructured==0.6.5 faiss-cpu==1.7.4 icetk==0.0.7" + ] + }, + { + "cell_type": "markdown", + "id": "853cdfa4-a2ce-4baa-919a-b9e2aecd2706", + "metadata": {}, + "source": [ + "Download the retrieval document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ba800dc-311d-4a83-8115-f05b09b39ffd", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/LLM_Survey_Chinese.pdf.txt" + ] + }, + { + "cell_type": "markdown", + "id": "07e923b3-b7ae-4983-abeb-2ce115566f15", + "metadata": {}, + "source": [ + "Download the text2vec model, for Chinese in our case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a07cd8d-3cec-40f6-8d2b-eb111aaf1164", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/GanymedeNil_text2vec-large-chinese.tar.gz\n", + "!tar -zxvf GanymedeNil_text2vec-large-chinese.tar.gz -C /tmp" + ] + }, + { + "cell_type": "markdown", + "id": "dc483af0-170e-4e61-8d25-a336d1592e34", + "metadata": {}, + "source": [ + "## Try out the model \n", + "\n", + "Load the Qwen-7B-Chat model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c112cf82-0447-46c4-9c32-18f243c0a686", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from abc import ABC\n", + "from langchain.llms.base import LLM\n", + "from typing import Any, List, Mapping, Optional\n", + "from langchain.callbacks.manager import CallbackManagerForLLMRun\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "\n", + "model_path=\"/tmp/models/Qwen/Qwen-7B-Chat\"\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).half().cuda()\n", + "model.eval()\n", + "\n", + "class Qwen(LLM, ABC):\n", + " max_token: int = 10000\n", + " temperature: float = 0.01\n", + " top_p = 0.9\n", + " history_len: int = 3\n", + "\n", + " def __init__(self):\n", + " super().__init__()\n", + "\n", + " @property\n", + " def _llm_type(self) -> str:\n", + " return \"Qwen\"\n", + "\n", + " @property\n", + " def _history_len(self) -> int:\n", + " return self.history_len\n", + "\n", + " def set_history_len(self, history_len: int = 10) -> None:\n", + " self.history_len = history_len\n", + "\n", + " def _call(\n", + " self,\n", + " prompt: str,\n", + " stop: Optional[List[str]] = None,\n", + " run_manager: Optional[CallbackManagerForLLMRun] = None,\n", + " ) -> str:\n", + " response, _ = model.chat(tokenizer, prompt, history=[])\n", + " return response\n", + " \n", + " @property\n", + " def _identifying_params(self) -> Mapping[str, Any]:\n", + " \"\"\"Get the identifying parameters.\"\"\"\n", + " return {\"max_token\": self.max_token,\n", + " \"temperature\": self.temperature,\n", + " \"top_p\": self.top_p,\n", + " \"history_len\": self.history_len}\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "382ed433-870f-424e-b074-210ea6f84b70", + "metadata": {}, + "source": [ + "Specify the txt file that needs retrieval for knowledge-based Q&A." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14be706b-4a7d-4906-9369-1f03c6c99854", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import torch\n", + "import argparse\n", + "from langchain.vectorstores import FAISS\n", + "from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n", + "from typing import List, Tuple\n", + "import numpy as np\n", + "from langchain.document_loaders import TextLoader\n", + "from chinese_text_splitter import ChineseTextSplitter\n", + "from langchain.docstore.document import Document\n", + "from langchain.prompts.prompt import PromptTemplate\n", + "from langchain.chains import RetrievalQA\n", + "\n", + "\n", + "def load_file(filepath, sentence_size=100):\n", + " loader = TextLoader(filepath, autodetect_encoding=True)\n", + " textsplitter = ChineseTextSplitter(pdf=False, sentence_size=sentence_size)\n", + " docs = loader.load_and_split(textsplitter)\n", + " write_check_file(filepath, docs)\n", + " return docs\n", + "\n", + "\n", + "def write_check_file(filepath, docs):\n", + " folder_path = os.path.join(os.path.dirname(filepath), \"tmp_files\")\n", + " if not os.path.exists(folder_path):\n", + " os.makedirs(folder_path)\n", + " fp = os.path.join(folder_path, 'load_file.txt')\n", + " with open(fp, 'a+', encoding='utf-8') as fout:\n", + " fout.write(\"filepath=%s,len=%s\" % (filepath, len(docs)))\n", + " fout.write('\\n')\n", + " for i in docs:\n", + " fout.write(str(i))\n", + " fout.write('\\n')\n", + " fout.close()\n", + "\n", + " \n", + "def seperate_list(ls: List[int]) -> List[List[int]]:\n", + " lists = []\n", + " ls1 = [ls[0]]\n", + " for i in range(1, len(ls)):\n", + " if ls[i - 1] + 1 == ls[i]:\n", + " ls1.append(ls[i])\n", + " else:\n", + " lists.append(ls1)\n", + " ls1 = [ls[i]]\n", + " lists.append(ls1)\n", + " return lists\n", + "\n", + "\n", + "class FAISSWrapper(FAISS):\n", + " chunk_size = 250\n", + " chunk_conent = True\n", + " score_threshold = 0\n", + " \n", + " def similarity_search_with_score_by_vector(\n", + " self, embedding: List[float], k: int = 4\n", + " ) -> List[Tuple[Document, float]]:\n", + " scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)\n", + " docs = []\n", + " id_set = set()\n", + " store_len = len(self.index_to_docstore_id)\n", + " for j, i in enumerate(indices[0]):\n", + " if i == -1 or 0 < self.score_threshold < scores[0][j]:\n", + " # This happens when not enough docs are returned.\n", + " continue\n", + " _id = self.index_to_docstore_id[i]\n", + " doc = self.docstore.search(_id)\n", + " if not self.chunk_conent:\n", + " if not isinstance(doc, Document):\n", + " raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n", + " doc.metadata[\"score\"] = int(scores[0][j])\n", + " docs.append(doc)\n", + " continue\n", + " id_set.add(i)\n", + " docs_len = len(doc.page_content)\n", + " for k in range(1, max(i, store_len - i)):\n", + " break_flag = False\n", + " for l in [i + k, i - k]:\n", + " if 0 <= l < len(self.index_to_docstore_id):\n", + " _id0 = self.index_to_docstore_id[l]\n", + " doc0 = self.docstore.search(_id0)\n", + " if docs_len + len(doc0.page_content) > self.chunk_size:\n", + " break_flag = True\n", + " break\n", + " elif doc0.metadata[\"source\"] == doc.metadata[\"source\"]:\n", + " docs_len += len(doc0.page_content)\n", + " id_set.add(l)\n", + " if break_flag:\n", + " break\n", + " if not self.chunk_conent:\n", + " return docs\n", + " if len(id_set) == 0 and self.score_threshold > 0:\n", + " return []\n", + " id_list = sorted(list(id_set))\n", + " id_lists = seperate_list(id_list)\n", + " for id_seq in id_lists:\n", + " for id in id_seq:\n", + " if id == id_seq[0]:\n", + " _id = self.index_to_docstore_id[id]\n", + " doc = self.docstore.search(_id)\n", + " else:\n", + " _id0 = self.index_to_docstore_id[id]\n", + " doc0 = self.docstore.search(_id0)\n", + " doc.page_content += \" \" + doc0.page_content\n", + " if not isinstance(doc, Document):\n", + " raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n", + " doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])\n", + " doc.metadata[\"score\"] = int(doc_score)\n", + " docs.append((doc, doc_score))\n", + " return docs\n", + "\n", + "\n", + "if __name__ == '__main__':\n", + " # load docs\n", + " filepath = 'LLM_Survey_Chinese.pdf.txt'\n", + " # LLM name\n", + " LLM_TYPE = 'qwen'\n", + " # Embedding model name\n", + " EMBEDDING_MODEL = 'text2vec'\n", + " # 基于上下文的prompt模版,请务必保留\"{question}\"和\"{context_str}\"\n", + " PROMPT_TEMPLATE = \"\"\"已知信息:\n", + " {context_str} \n", + " 根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}\"\"\"\n", + " # Embedding running device\n", + " EMBEDDING_DEVICE = \"cuda\"\n", + " # return top-k text chunk from vector store\n", + " VECTOR_SEARCH_TOP_K = 3\n", + " # 文本分句长度\n", + " SENTENCE_SIZE = 50\n", + " CHAIN_TYPE = 'stuff'\n", + " llm_model_dict = {\n", + " \"qwen\": QWen,\n", + " }\n", + " embedding_model_dict = {\n", + " \"text2vec\": \"/tmp/GanymedeNil_text2vec-large-chinese\",\n", + " }\n", + " print(\"loading model start\")\n", + " llm = llm_model_dict[LLM_TYPE]()\n", + " embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[EMBEDDING_MODEL],model_kwargs={'device': EMBEDDING_DEVICE})\n", + " print(\"loading model done\")\n", + "\n", + " print(\"loading documents start\")\n", + " docs = load_file(filepath, sentence_size=SENTENCE_SIZE)\n", + " print(\"loading documents done\")\n", + "\n", + " print(\"embedding start\")\n", + " docsearch = FAISSWrapper.from_documents(docs, embeddings)\n", + " print(\"embedding done\")\n", + "\n", + " print(\"loading qa start\")\n", + " prompt = PromptTemplate(\n", + " template=PROMPT_TEMPLATE, input_variables=[\"context_str\", \"question\"]\n", + " )\n", + "\n", + " chain_type_kwargs = {\"prompt\": prompt, \"document_variable_name\": \"context_str\"}\n", + " qa = RetrievalQA.from_chain_type(\n", + " llm=llm,\n", + " chain_type=CHAIN_TYPE, \n", + " retriever=docsearch.as_retriever(search_kwargs={\"k\": VECTOR_SEARCH_TOP_K}), \n", + " chain_type_kwargs=chain_type_kwargs)\n", + " print(\"loading qa done\")\n", + "\n", + " query = \"大模型指令微调有好的策略?\" \n", + " print(qa.run(query))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/ascend/README.md b/recipes/finetune/ascend/README.md new file mode 100644 index 0000000..a3e43b0 --- /dev/null +++ b/recipes/finetune/ascend/README.md @@ -0,0 +1,142 @@ +# Fine-tuning Qwen by Ascend NPU +Below, we provide a simple example to show how to finetune Qwen by Ascend NPU. You can also refer to the official [mindformers](https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/qwen.md) for detailed usage. + +## Environment Requirement + +- Hardware: Ascend 910A/B + +## Quickstart + +1. Launch Docker Image + +```bash +ImageID=pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:qwen_v23.0.rc3 +docker run -it -u root --ipc=host \ +--device=/dev/davinci0 \ +--device=/dev/davinci1 \ +--device=/dev/davinci2 \ +--device=/dev/davinci3 \ +--device=/dev/davinci4 \ +--device=/dev/davinci5 \ +--device=/dev/davinci6 \ +--device=/dev/davinci7 \ +--device=/dev/davinci_manager \ +--device=/dev/devmm_svm \ +--device=/dev/hisi_hdc \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /var/log/npu/:/usr/slog \ +-v /etc/hccn.conf:/etc/hccn.conf \ +${ImageID} /bin/bash +``` + +2. Download and Convert model + +- download model by modelscope + +```bash +cd mindformers +python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-7B-Chat', cache_dir='.', revision='master')" +``` + +- convert hf model weights to ckpt weights + +```bash +python research/qwen/convert_weight.py \ + --torch_ckpt_dir Qwen/Qwen-7B-Chat \ + --mindspore_ckpt_path qwen-7b-chat.ckpt + +mkdir -vp load_checkpoint/rank_0 +mv qwen-7b-chat.ckpt load_checkpoint/rank_0/ +``` + +3. Prepare training data + +- download demo data + +```bash +wget -c https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/alpaca_data_min.json +``` + +- Converts the raw data to the specified format + +```bash +python research/qwen/alpaca_converter.py \ + --data_path alpaca_data_min.json \ + --output_path alpaca-data-conversation_min.json +``` + +- Generate Mindrecord data + +```bash +python research/qwen/qwen_preprocess.py \ + --input_glob alpaca-data-conversation_min.json \ + --model_file Qwen/Qwen-7B-Chat/qwen.tiktoken \ + --seq_length 1024 \ + --output_file alpaca_min.mindrecord +``` + +4. Prepare RANK_TABLE_FILE + +```bash +# generate RANK_TABLE_FILE with 8 npu +python mindformers/tools/hccl_tools.py --device_num "[0,8)" +``` + +5. Fine-tune + +You need to replace RANK_TABLE_FILE with the file generated in step 5. + +```bash +export MS_ASCEND_CHECK_OVERFLOW_MODE=INFNAN_MODE +bash research/run_singlenode.sh "python3 research/qwen/run_qwen.py \ +--config research/qwen/run_qwen_7b.yaml \ +--load_checkpoint /mindformers/research/qwen/load_checkpoint \ +--vocab_file Qwen/Qwen-7B-Chat/qwen.tiktoken \ +--use_parallel True \ +--run_mode finetune \ +--auto_trans_ckpt True \ +--train_data alpaca_min.mindrecord" \ +RANK_TABLE_FILE [0,8] 8 +``` + +6. Merge model weights + +- Rename model weights + +```bash +cd output/checkpoint_network +mv rank_0/qwen_rank_0-network.ckpt rank_0/checkpoint_0.ckpt +mv rank_1/qwen_rank_1-network.ckpt rank_1/checkpoint_1.ckpt +mv rank_2/qwen_rank_2-network.ckpt rank_2/checkpoint_2.ckpt +mv rank_3/qwen_rank_3-network.ckpt rank_3/checkpoint_3.ckpt +mv rank_4/qwen_rank_4-network.ckpt rank_4/checkpoint_4.ckpt +mv rank_5/qwen_rank_5-network.ckpt rank_5/checkpoint_5.ckpt +mv rank_6/qwen_rank_6-network.ckpt rank_6/checkpoint_6.ckpt +mv rank_7/qwen_rank_7-network.ckpt rank_7/checkpoint_7.ckpt +cd ../.. +``` + +- Merge model weights + +```bash +python mindformers/tools/transform_ckpt.py \ + --src_ckpt_strategy output/strategy \ + --src_ckpt_dir output/checkpoint_network \ + --dst_ckpt_dir output/merged_model +``` + +7. Inference fine-tuned model + +```bash +python research/qwen/run_qwen.py \ + --config research/qwen/run_qwen_7b.yaml \ + --predict_data '比较适合深度学习入门的书籍有' \ + --run_mode predict \ + --load_checkpoint output/merged_model/rank_0/checkpoint_0.ckpt \ + --auto_trans_ckpt False \ + --device_id 0 +``` \ No newline at end of file diff --git a/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb b/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb new file mode 100644 index 0000000..c96ed37 --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb @@ -0,0 +1,213 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n", + " --data_path \"Belle_sampled_qwen.json\" \\\n", + " --bf16 True \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing True \\\n", + " --lazy_preprocess True \\\n", + " --deepspeed \"../../finetune/ds_config_zero2.json\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "We can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb b/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb new file mode 100644 index 0000000..a6f236d --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb @@ -0,0 +1,234 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2023-12-31T03:19:11.059814Z", + "iopub.status.busy": "2023-12-31T03:19:11.059177Z", + "iopub.status.idle": "2023-12-31T03:21:54.157827Z", + "shell.execute_reply": "2023-12-31T03:21:54.157333Z", + "shell.execute_reply.started": "2023-12-31T03:19:11.059783Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "execution": { + "iopub.execute_input": "2023-12-31T03:21:57.596577Z", + "iopub.status.busy": "2023-12-31T03:21:57.595847Z", + "iopub.status.idle": "2023-12-31T03:21:57.971112Z", + "shell.execute_reply": "2023-12-31T03:21:57.970576Z", + "shell.execute_reply.started": "2023-12-31T03:21:57.596555Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2023-12-31T03:23:52.455178Z", + "iopub.status.busy": "2023-12-31T03:23:52.454615Z", + "iopub.status.idle": "2023-12-31T03:24:15.699948Z", + "shell.execute_reply": "2023-12-31T03:24:15.699358Z", + "shell.execute_reply.started": "2023-12-31T03:23:52.455144Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!python ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n", + " --data_path \"Belle_sampled_qwen.json\"\\\n", + " --bf16 \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing \\\n", + " --lazy_preprocess" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "We can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb b/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb new file mode 100644 index 0000000..bbceaa4 --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb @@ -0,0 +1,267 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\" \\\n", + " --data_path \"Belle_sampled_qwen.json\" \\\n", + " --bf16 True \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing True \\\n", + " --lazy_preprocess True \\\n", + " --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n", + " --use_lora" + ] + }, + { + "cell_type": "markdown", + "id": "35acf008-1dfe-4d32-8cf5-7022e042aadb", + "metadata": {}, + "source": [ + "## Merge Weights\n", + "\n", + "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61021499-4a44-45af-a682-943ed63c2fcb", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM\n", + "from peft import PeftModel\n", + "import torch\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n", + "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n", + "merged_model = model.merge_and_unload()\n", + "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)" + ] + }, + { + "cell_type": "markdown", + "id": "0dfbd261-6451-4532-82e8-3ae19ed93ee1", + "metadata": {}, + "source": [ + "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ddcba069-340b-4a93-a145-2028b425dd23", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\n", + " \"Qwen/Qwen-1_8B-Chat/\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "tokenizer.save_pretrained(\"output_qwen_merged\")" + ] + }, + { + "cell_type": "markdown", + "id": "fe9f2878-79d3-4b1c-ba95-ac2f73aa6e1b", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "After merging the weights, we can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen_merged\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb b/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb new file mode 100644 index 0000000..44bfda9 --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb @@ -0,0 +1,274 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# LoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!export CUDA_VISIBLE_DEVICES=0\n", + "!python ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat/\"\\\n", + " --data_path \"Belle_sampled_qwen.json\"\\\n", + " --bf16 \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing \\\n", + " --lazy_preprocess \\\n", + " --use_lora" + ] + }, + { + "cell_type": "markdown", + "id": "5e6f28aa-1772-48ce-aa15-8cf29e7d67b5", + "metadata": {}, + "source": [ + "## Merge Weights\n", + "\n", + "The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fd5ef2a-34f9-4909-bebe-7b3b086fd16a", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM\n", + "from peft import PeftModel\n", + "import torch\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n", + "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n", + "merged_model = model.merge_and_unload()\n", + "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)" + ] + }, + { + "cell_type": "markdown", + "id": "2e3f5b9f-63a1-4599-8d9b-a8d8f764838f", + "metadata": {}, + "source": [ + "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10fa5ea3-dd55-4901-86af-c045d4c56533", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\n", + " \"Qwen/Qwen-1_8B-Chat/\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "tokenizer.save_pretrained(\"output_qwen_merged\")" + ] + }, + { + "cell_type": "markdown", + "id": "804b84d8", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "After merging the weights, we can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen_merged\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb b/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb new file mode 100644 index 0000000..13808d5 --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb @@ -0,0 +1,282 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Multiple GPUs)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2023-12-31T08:42:52.842315Z", + "iopub.status.busy": "2023-12-31T08:42:52.841665Z", + "iopub.status.idle": "2023-12-31T08:44:19.832661Z", + "shell.execute_reply": "2023-12-31T08:44:19.832193Z", + "shell.execute_reply.started": "2023-12-31T08:42:52.842295Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model. **nproc_per_node** refers to the number of GPUs used fro training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "execution": { + "iopub.execute_input": "2023-12-31T08:45:37.959631Z", + "iopub.status.busy": "2023-12-31T08:45:37.958961Z", + "iopub.status.idle": "2023-12-31T08:46:19.501657Z", + "shell.execute_reply": "2023-12-31T08:46:19.500854Z", + "shell.execute_reply.started": "2023-12-31T08:45:37.959609Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\" \\\n", + " --data_path \"Belle_sampled_qwen.json\" \\\n", + " --bf16 True \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing True \\\n", + " --lazy_preprocess True \\\n", + " --deepspeed \"../../finetune/ds_config_zero2.json\" \\\n", + " --use_lora \\\n", + " --q_lora" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Merge Weights\n", + "\n", + "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n", + "\n", + "You can load the fine-tuned model and merge weights as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n", + "\n", + "from transformers import AutoModelForCausalLM\n", + "from peft import PeftModel\n", + "import torch\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n", + "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n", + "merged_model = model.merge_and_unload()\n", + "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\n", + " \"Qwen/Qwen-1_8B-Chat-Int4/\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "tokenizer.save_pretrained(\"output_qwen_merged\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "After merging the weights, we can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen_merged\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb b/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb new file mode 100644 index 0000000..6a323b1 --- /dev/null +++ b/recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb @@ -0,0 +1,283 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e6981ab-2d9a-4280-923f-235a166855ba", + "metadata": {}, + "source": [ + "# QLoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)\n", + "\n", + "Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.\n", + "\n", + "This notebook uses Qwen-1.8B-Chat as an example to introduce how to QLoRA fine-tune the Qianwen model using Deepspeed.\n", + "\n", + "## Environment Requirements\n", + "\n", + "Please refer to **requirements.txt** to install the required dependencies.\n", + "\n", + "## Preparation\n", + "\n", + "### Download Qwen-1.8B-Chat\n", + "\n", + "First, download the model files. You can choose to download directly from ModelScope.\n", + "\n", + "Note that we use the Int4 version of the models for QLoRA training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "248488f9-4a86-4f35-9d56-50f8e91a8f11", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat-Int4', cache_dir='.', revision='master')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7b2a92b1-f08e-4413-9f92-8f23761e6e1f", + "metadata": {}, + "source": [ + "### Download Example Training Data\n", + "\n", + "Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).\n", + "\n", + "Disclaimer: the dataset can be only used for the research purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce195f08-fbb2-470e-b6c0-9a03457458c7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json" + ] + }, + { + "cell_type": "markdown", + "id": "7226bed0-171b-4d45-a3f9-b3d81ec2bb9f", + "metadata": {}, + "source": [ + "You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"我是一个语言模型,我叫通义千问。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "You can also use multi-turn conversations as the training set. Here is a simple example:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"id\": \"identity_0\",\n", + " \"conversations\": [\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"你好,能告诉我遛狗的最佳时间吗?\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"当地最佳遛狗时间因地域差异而异,请问您所在的城市是哪里?\"\n", + " },\n", + " {\n", + " \"from\": \"user\",\n", + " \"value\": \"我在纽约市。\"\n", + " },\n", + " {\n", + " \"from\": \"assistant\",\n", + " \"value\": \"纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间,因为这些时间段气温较低,遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。\"\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n", + "\n", + "## Fine-Tune the Model\n", + "\n", + "You can directly run the prepared training script to fine-tune the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ab0581e-be85-45e6-a5b7-af9c42ea697b", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!python ../../finetune.py \\\n", + " --model_name_or_path \"Qwen/Qwen-1_8B-Chat-Int4/\"\\\n", + " --data_path \"Belle_sampled_qwen.json\"\\\n", + " --bf16 \\\n", + " --output_dir \"output_qwen\" \\\n", + " --num_train_epochs 5 \\\n", + " --per_device_train_batch_size 1 \\\n", + " --per_device_eval_batch_size 1 \\\n", + " --gradient_accumulation_steps 16 \\\n", + " --evaluation_strategy \"no\" \\\n", + " --save_strategy \"steps\" \\\n", + " --save_steps 1000 \\\n", + " --save_total_limit 10 \\\n", + " --learning_rate 1e-5 \\\n", + " --weight_decay 0.1 \\\n", + " --adam_beta2 0.95 \\\n", + " --warmup_ratio 0.01 \\\n", + " --lr_scheduler_type \"cosine\" \\\n", + " --logging_steps 1 \\\n", + " --report_to \"none\" \\\n", + " --model_max_length 512 \\\n", + " --gradient_checkpointing \\\n", + " --lazy_preprocess \\\n", + " --use_lora \\\n", + " --q_lora \\\n", + " --deepspeed \"../../finetune/ds_config_zero2.json\"" + ] + }, + { + "cell_type": "markdown", + "id": "0a50941d-3c3c-4ed2-9185-d4fe6172da2f", + "metadata": {}, + "source": [ + "## Merge Weights\n", + "\n", + "The training of both LoRA and Q-LoRA only saves the adapter parameters. Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model.\n", + "\n", + "You can load the fine-tuned model and merge weights as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "909ff537-f851-488e-b1e8-1046f6852202", + "metadata": { + "ExecutionIndicator": { + "show": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from modelscope.hub.snapshot_download import snapshot_download\n", + "snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')\n", + "\n", + "from transformers import AutoModelForCausalLM\n", + "from peft import PeftModel\n", + "import torch\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-1_8B-Chat/\", torch_dtype=torch.float16, device_map=\"auto\", trust_remote_code=True)\n", + "model = PeftModel.from_pretrained(model, \"output_qwen/\")\n", + "merged_model = model.merge_and_unload()\n", + "merged_model.save_pretrained(\"output_qwen_merged\", max_shard_size=\"2048MB\", safe_serialization=True)" + ] + }, + { + "cell_type": "markdown", + "id": "7969df6e-ba8a-45f5-8b44-e1cbe74a8ef6", + "metadata": {}, + "source": [ + "The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c01b6a3f-036f-4b7c-b5a6-76a7b6894d4e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\n", + " \"Qwen/Qwen-1_8B-Chat-Int4/\",\n", + " trust_remote_code=True\n", + ")\n", + "\n", + "tokenizer.save_pretrained(\"output_qwen_merged\")" + ] + }, + { + "cell_type": "markdown", + "id": "c2944b9b-89c7-4fb5-bd08-941d4706e943", + "metadata": {}, + "source": [ + "## Test the Model\n", + "\n", + "After merging the weights, we can test the model as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b77abbb1-5b29-4eb1-8a6c-e2e146b8d33d", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "from transformers.generation import GenerationConfig\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"output_qwen_merged\", trust_remote_code=True)\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " \"output_qwen_merged\",\n", + " device_map=\"auto\",\n", + " trust_remote_code=True\n", + ").eval()\n", + "\n", + "response, history = model.chat(tokenizer, \"你好\", history=None)\n", + "print(response)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/finetune/deepspeed/readme.md b/recipes/finetune/deepspeed/readme.md new file mode 100644 index 0000000..171c05b --- /dev/null +++ b/recipes/finetune/deepspeed/readme.md @@ -0,0 +1,416 @@ +# Fine-tuning Qwen Using Deepspeed + + +## TL;DR + +We provide the official training script `finetune.py` and serveral notebooks that can be leveraged for users to finetune pre-trained models for downstream applications in a simple fashion. The algorithms that we support include full-parameter fine-tuning, LoRA fine-tuning and Q-LoRA fine-tuning. Here is the matrix of our notebooks used in different settings: + +| Algorithm | Single GPU | Multiple GPUs| +| --- | --- | --- | +| Full-parameter Fine-tuning | [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb) | [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb) | +| LoRA Fine-tuning | [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb) | [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb) | +| Q-LoRA Fine-tuning | [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb) | [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb) | + +## Requirements + +### Environments + +The basic requirements for running Qwen models include: + +- python 3.8 and above +- pytorch 1.12 and above, 2.0 and above are recommended +- transformers 4.32 and above +- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) + +Our notebooks launch fine-tuning with DeepSpeed and Peft. +(Note: this may have conflicts with the latest version of pydantic and you should use make sure `pydantic<2.0`.) +You can install them by: +```bash +pip install peft deepspeed +``` + +### Settings and GPU Requirements + +We first provide the support matrix for different learning settings. Full-parameter fine-tuning requires updating all parameters in the whole training process. +In comparison with full-parameter fine-tuning, LoRA only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. If you still suffer from insufficient memory, you can consider Q-LoRA, which uses the quantized large language model to allow even fewer memory costs. Generally, the GPU consumption rule for tuning Qwen is as follows: full parameter > full parameter (ZeRO2) > full parameter (ZeRO3) > LoRA > LoRA (ZeRO2) > LoRA (ZeRO3) > Q-LoRA > Q-LoRA (ZeRO2). + +| Setting | Full-parameter | LoRA | Q-LoRA | +| --- | --- | --- | --- | +| Base | Yes (up to ZeRO3) | Yes (up to ZeRO2) | No | +| Chat | Yes (up to ZeRO3) | Yes (up to ZeRO3) | No | +| Chat-Int4/8 | No | No | Yes | + +Here are some useful suggestions for choosing different fine-tuning settings based on GPU memory, espcially for users with GeForce RTX 3090/4090 (24GB) GPUs (or similar), and A100 (80GB) GPUs (or similar). In the experiments, we uniformly use a batch size of 1, gradient accumulation of 16, and max length of 512. Other parameters are set as the same shown in our notebooks. The results are as follows. + +| GPU Memory | Number of GPUs | Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat | +| --- | --- | --- | --- | --- | --- | +| 24GB | *1 | Full Parameter | LoRA | Q-LoRA | N/A | +| 24GB | *2 | Full Parameter | LoRA | Q-LoRA | N/A | +| 24GB | *4 | Full Parameter | LoRA | LoRA (w/ ZeRO3) | N/A | +| 80GB | *1 | Full Parameter | LoRA | LoRA | Q-LoRA | +| 80GB | *2 | Full Parameter | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO2) | TBD | +| 80GB | *4 | Full Parameter | Full Parameter (w/ ZeRO2) | Full Parameter (w/ ZeRO3) | LoRA (w/ ZeRO3) | + +Using other configurations of LoRA/Q-LoRA and ZeRO stages will easily result in failures. + + +## Data Preparation + +To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample: +```json +[ + { + "id": "identity_0", + "conversations": [ + { + "from": "user", + "value": "你好" + }, + { + "from": "assistant", + "value": "我是一个语言模型,我叫通义千问。" + } + ] + } +] +``` + +You can also use multi-turn conversations as the training set. Here is a simple example: + +```json +[ + { + "id": "identity_0", + "conversations": [ + { + "from": "user", + "value": "你好" + }, + { + "from": "assistant", + "value": "你好!我是一名AI助手,我叫通义千问,有需要请告诉我。" + }, + { + "from": "user", + "value": "你都能做什么" + }, + { + "from": "assistant", + "value": "我能做很多事情,包括但不限于回答各种领域的问题、提供实用建议和指导、进行多轮对话交流、文本生成等。" + } + ] + } +] +``` + + +## Single-GPU Training + +In the single-GPU training setting, we provide three notebooks: + +- [finetune_fullparameter_single_gpu](finetune_fullparameter_single_gpu.ipynb) +- [finetune_lora_single_gpu](finetune_lora_single_gpu.ipynb) +- [finetune_qlora_single_gpu](finetune_qlora_single_gpu.ipynb) + +### Full-parameter Fine-tuning + +To launch your training, run the following command (with hyper-parameter settings omitted): +```bash +python finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT +``` +Remember to specify the correct model name or path, the data path, as well as the output directory. + +### LoRA Fine-tuning + +Similarly, to run LoRA, use another notebook to run the command as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pre-trained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pre-trained model to load. +```bash +python finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT \ + --use_lora +``` +Note that if you use LoRA to fine-tune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Check out the following code in the training script `finetune.py`: +```python +is_chat_model = 'chat' in model_args.model_name_or_path.lower() +if training_args.use_lora: + if lora_args.q_lora or is_chat_model: + modules_to_save = None + else: + modules_to_save = ["wte", "lm_head"] + lora_config = LoraConfig( + r=lora_args.lora_r, + lora_alpha=lora_args.lora_alpha, + target_modules=lora_args.lora_target_modules, + lora_dropout=lora_args.lora_dropout, + bias=lora_args.lora_bias, + task_type="CAUSAL_LM", + modules_to_save=modules_to_save # This argument serves for adding new tokens. + ) + ... + model = get_peft_model(model, lora_config) + ... +``` +Pay attention that the script relies on the model path to identify the model type, so please keep `chat` in the chat model paths. + + + +### Q-LoRA Fine-tuning + +To run single-GPU Q-LoRA training, you may need to install `mpi4py`. Directly run the following script: +```bash +python finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT \ + --use_lora \ + --q_lora \ + --deepspeed "ds_config_zero2.json" +``` + +For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter fine-tuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use DeepSpeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. + + +In default, our notebooks provide training codes for Qwen-1.8B-Chat. +You can also run the training script to fine-tune other version of the Qwen-series models. We profile the GPU memory usage of all versions based on our notebooks (without changing any hyper-parameter settings) on a single A800 GPU (80GB). The statistics are listed below: + +| Training | Qwen-1.8B-Chat | Qwen-7B-Chat | Qwen-14B-Chat | Qwen-72B-Chat | +| --- | --- | --- | --- | --- | +| Full Parameter | 19.6GB | 76.8GB | OOM | OOM | +| LoRA | 7.4GB | 20.3GB | 34.2GB | OOM | +| Q-LoRA | 6.1GB | 12.5GB | 17.8GB | 61.9GB | + + +### Merging Weights from LoRA and Q-LoRA + + +#### Inference with Adapters + +Different from full-parameter fine-tuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the fine-tuned model for inference as shown below: +```python +from peft import AutoPeftModelForCausalLM +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() + +response, history = model.chat(tokenizer, "你好", history=None) +``` + +#### Inference with Merged Weights + +If you want to merge the adapters and save the fine-tuned model as a standalone model, take LoRA as an example, you can run the following codes: +```python +from peft import AutoPeftModelForCausalLM + +model = AutoPeftModelForCausalLM.from_pretrained( + path_to_adapter, # path to the output directory + device_map="auto", + trust_remote_code=True +).eval() + +merged_model = model.merge_and_unload() +# max_shard_size and safe serialization are not necessary. +# They respectively work for sharding checkpoint and save the model to safetensors. +merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) +``` + +The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code: +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) +tokenizer.save_pretrained(new_model_directory) +``` +Next, the model with merged weights can be loaded by the following code: +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained(new_model_directory, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + new_model_directory, + device_map="auto", + trust_remote_code=True +).eval() +response, history = model.chat(tokenizer, "你好", history=None) +``` + +Note that you can not merge weights into quantized models. Instead, we can merge the weights based on the original chat model. Take Qwen-7B-Chat-In4 as an example. +```python +from transformers import AutoModelForCausalLM +from peft import PeftModel +import torch + +# Here, we load the original Qwen-7B-Chat model, instead of the Qwen-7B-Chat-Int4 model. +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True) +# We merge the learned adapter to the Qwen-7B-Chat. +model = PeftModel.from_pretrained(model, path_to_adapter) +merged_model = model.merge_and_unload() +# We save the model to a new path. +merged_model.save_pretrained(path_to_new_model, max_shard_size="2048MB", safe_serialization=True) +``` + + +## Multi-GPU Training + +In the multi-GPU training setting, we provide three notebooks: + +- [finetune_fullparameter_multi_gpu](finetune_fullparameter_multi_gpu.ipynb) +- [finetune_lora_multi_gpu](finetune_lora_multi_gpu.ipynb) +- [finetune_qlora_multi_gpu](finetune_qlora_multi_gpu.ipynb) + +We use `torchrun` to launch the training job on multiple GPUs: + +```bash +# for full-parameter fine-tuning +torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT \ + --deepspeed "ds_config_zero2.json" + +# for LoRA fine-tuning +torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT \ + --deepspeed "ds_config_zero2.json" \ + --use_lora + +# for Q-LoRA fine-tuning +torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 finetune.py \ + --model_name_or_path $MODEL \ + --data_path $DATA \ + --output_dir $OUTPUT \ + --deepspeed "ds_config_zero2.json" \ + --use_lora \ + --q_lora +``` + +For multi-GPU training, you also need to specify proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed. +For the usage of `torchrun` and distrubuted arguments, please refer to [here](https://pytorch.org/docs/stable/elastic/run.html). +Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA fine-tune the chat models. Check the profile below for more information. + + +### Multi-node Fine-tuning + +Our provided scripts also support multi-node fine-tuning. You can refer to the comments in the scripts to correctly set corresponding arguments and launch the script on each node. For more information about multi-node distributed training, please refer to [torchrun](https://pytorch.org/docs/stable/elastic/run.html). + +Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multi-node fine-tuning scripts. + +### Profiling of Memory and Speed + +We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter fine-tuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory. + +For Qwen-7B, we also test the performance of multi-node fine-tuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multi-node fine-tuning are marked as LoRA (multinode) in the table. + +For Qwen-72B, we experiment in two ways: 1) LoRA fine-tuning + DeepSpeed ZeRO 3 on 4 A100-SXM4-80G GPUs and 2) Q-LoRA (int4) fine-tuning on a single A100-SXM4-80G GPU. Note that OOM occurs on 4 A100-SXM4-80G GPUs both with LoRA (emb) fine-tuning and LoRA fine-tuning without Deepspeed ZeRO 3 (you can pass `--deepspeed ds_config_zero3.json` to `finetune_lora_ds.sh` to enable DeepSpeed ZeRO 3). + +The statistics are listed below: + +
Model Size | Method | #Nodes | #GPUs per node | Sequence Length | +|||||
---|---|---|---|---|---|---|---|---|---|
256 | 512 | 1024 | 2048 | 4096 | 8192 | +||||
1.8B | LoRA | +1 | 1 | +6.7G / 1.0s/it | 7.4G / 1.0s/it | 8.4G / 1.1s/it | 11.0G / 1.7s/it | 16.2G / 3.3s/it | 21.8G / 6.8s/it | +
LoRA (emb) | +1 | 1 | +13.7G / 1.0s/it | 14.0G / 1.0s/it | 14.0G / 1.1s/it | 15.1G / 1.8s/it | 19.7G / 3.4s/it | 27.7G / 7.0s/it | +|
Q-LoRA | +1 | 1 | +5.8G / 1.4s/it | 6.0G / 1.4s/it | 6.6G / 1.4s/it | 7.8G / 2.0s/it | 10.2G / 3.4s/it | 15.8G / 6.5s/it | +|
Full-parameter | +1 | 1 | +43.5G / 2.1s/it | 43.5G / 2.2s/it | 43.5G / 2.2s/it | 43.5G / 2.3s/it | 47.1G / 2.8s/it | 48.3G / 5.6s/it | +|
7B | +LoRA | +1 | 1 | +20.1G / 1.2s/it | 20.4G / 1.5s/it | 21.5G / 2.8s/it | 23.8G / 5.2s/it | 29.7G / 10.1s/it | 36.6G / 21.3s/it | +
LoRA (emb) | +1 | 1 | +33.7G / 1.4s/it | 34.1G / 1.6s/it | 35.2G / 2.9s/it | 35.1G / 5.3s/it | 39.2G / 10.3s/it | 48.5G / 21.7s/it | +|
Q-LoRA | +1 | 1 | +11.5G / 3.0s/it | 11.5G / 3.0s/it | 12.3G / 3.5s/it | 13.9G / 7.0s/it | 16.9G / 11.6s/it | 23.5G / 22.3s/it | +|
Full-parameter | +1 | 2 | +139.2G / 4.0s/it | 148.0G / 4.0s/it | 162.0G / 4.5s/it | - | - | - | +|
LoRA (multinode) | +2 | 2 | +74.7G / 2.09s/it | 77.6G / 3.16s/it | 84.9G / 5.17s/it | 95.1G / 9.25s/it | 121.1G / 18.1s/it | 155.5G / 37.4s/it | +|
14B | +LoRA | +1 | 1 | +34.6G / 1.6s/it | 35.1G / 2.4s/it | 35.3G / 4.4s/it | 37.4G / 8.4s/it | 42.5G / 17.0s/it | 55.2G / 36.0s/it | +
LoRA (emb) | +1 | 1 | +51.2 / 1.7s/it | 51.1G / 2.6s/it | 51.5G / 4.6s/it | 54.1G / 8.6s/it | 56.8G / 17.2s/it | 67.7G / 36.3s/it | +|
Q-LoRA | +1 | 1 | +18.7G / 5.3s/it | 18.4G / 6.3s/it | 18.9G / 8.2s/it | 19.9G / 11.8s/it | 23.0G / 20.1s/it | 27.9G / 38.3s/it | +|
72B | +LoRA + Deepspeed Zero3 | +1 | 4 | +215.4G / 17.6s/it | 217.7G / 20.5s/it | 222.6G / 29.4s/it | 228.8G / 45.7s/it | 249.0G / 83.4s/it | 289.2G / 161.5s/it | +
Q-LoRA | +1 | 1 | +61.4G / 27.4s/it | 61.4G / 31.5s/it | 62.9G / 41.4s/it | 64.1G / 59.5s/it | 68.0G / 97.7s/it | 75.6G / 179.8s/it | +
+In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below: +
+ +```python +from modelscope import snapshot_download +from transformers import AutoModelForCausalLM, AutoTokenizer + +# Downloading model checkpoint to a local dir model_dir +# model_dir = snapshot_download('qwen/Qwen-7B') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat') +# model_dir = snapshot_download('qwen/Qwen-14B') +model_dir = snapshot_download('qwen/Qwen-14B-Chat') + +# Loading local checkpoints +# trust_remote_code is still set as True since we still load codes from local dir instead of transformers +tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + model_dir, + device_map="auto", + trust_remote_code=True +).eval() +``` + +## 🤖 ModelScope + +ModelScope is an open-source platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: + +```python +from modelscope import AutoModelForCausalLM, AutoTokenizer +from modelscope import GenerationConfig + +# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 + +response, history = model.chat(tokenizer, "你好", history=None) +print(response) +response, history = model.chat(tokenizer, "浙江的省会在哪里?", history=history) +print(response) +response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history) +print(response) +``` + +## Batch Inference +Qwen supports batch inference. With flash attention enabled, using batch inference can bring a 40% speedup. The example code is shown below: +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers import GenerationConfig +from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids + +tokenizer = AutoTokenizer.from_pretrained( + './', + pad_token='<|extra_0|>', + eos_token='<|endoftext|>', + padding_side='left', + trust_remote_code=True +) +model = AutoModelForCausalLM.from_pretrained( + './', + pad_token_id=tokenizer.pad_token_id, + device_map="auto", + trust_remote_code=True +).eval() +model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id) + +all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"] +batch_raw_text = [] +for q in all_raw_text: + raw_text, _ = make_context( + tokenizer, + q, + system="You are a helpful assistant.", + max_window_size=model.generation_config.max_window_size, + chat_format=model.generation_config.chat_format, + ) + batch_raw_text.append(raw_text) + +batch_input_ids = tokenizer(batch_raw_text, padding='longest') +batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device) +batch_out_ids = model.generate( + batch_input_ids, + return_dict_in_generate=False, + generation_config=model.generation_config +) +padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))] + +batch_response = [ + decode_tokens( + batch_out_ids[i][padding_lens[i]:], + tokenizer, + raw_text_len=len(batch_raw_text[i]), + context_length=(batch_input_ids[i].size(0)-padding_lens[i]), + chat_format="chatml", + verbose=False, + errors='replace' + ) for i in range(len(all_raw_text)) +] +print(batch_response) + +response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None) +print(response) + +response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None) +print(response) + +response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None) +print(response) +``` + +## CPU + +To deploy our models on CPU, we strongly advise you to use [qwen.cpp](https://github.com/QwenLM/qwen.cpp), which is a pure C++ implementation of Qwen and tiktoken. Check the repo for more details! + +Also, it is also simple to directly run the model on CPU, which requires your specification of device: + +```python +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval() +``` + +However, it is likely that you suffer from extremely low inference efficiency. + +## Multiple GPUs + +If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can directly use the default loading method, which is now supported by Transformers. The previous method based on `utils.py` is deprecated. + +However, though this method is simple, the efficiency of the native pipeline parallelism is low. We advise you to use vLLM with FastChat and please read [the section](../vllm/README.md) for deployment. \ No newline at end of file diff --git a/recipes/inference/quantization/README.md b/recipes/inference/quantization/README.md new file mode 100644 index 0000000..f95e44a --- /dev/null +++ b/recipes/inference/quantization/README.md @@ -0,0 +1,113 @@ +# Quantization + +## GPTQ + +We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed. + +Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: + +```bash +pip install auto-gptq optimum +``` + +If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel. + +> Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update, +> you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`. +> We recommend using the latest versions meeting the following requirements: +> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 +> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 + +Then you can load the quantized model easily and run inference as same as usual: + +```python +# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4" +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat-Int4", + device_map="auto", + trust_remote_code=True +).eval() +response, history = model.chat(tokenizer, "Hi", history=None) +``` + +We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: + +| Quantization | MMLU | CEval (val) | GSM8K | Humaneval | +|----------------------|:----:|:-----------:|:-----:|:---------:| +| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 | +| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 | +| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 | +| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 | +| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 | +| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 | +| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 | +| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 | +| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | +| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 | +| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 | +| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 | + +## Quantization of KV cache + +> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality +> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download +> them from the Hugging Face Hub and place them into the same folder as the other module files. + +The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows: +```python +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen-7B-Chat", + device_map="auto", + trust_remote_code=True, + use_cache_quantization=True, + use_cache_kernel=True, + use_flash_attn=False +) +``` +Attention: Currently, KV cache quantization and flash attention cannot be used at the same time. +If you enable KV cache quantization and flash attention at the same time (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` is disabled by default (`use_flash_attn=false`). + +We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions. +The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. +We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error. + +With KV cache quantization, the model can infer with a larger batch size (bs). + +| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 | +|--------------|:------:|:------:|:------:|:------:|:------:|:------:| +| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM | +| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB | + +With KV cache quantization the model can save more memory when generating longer sequence (`sl`, sequence length, referring to the number of tokens generated) at the stage of inference. + +| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 | +|--------------|:------:|:-------:|:-------:|:-------:|:-------:| +| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB | +| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB | + +The model with KV cache quantization will convert the format of `layer_past` from float to int8, and meanwhile the quantized `layer-past` will also store the quantization parameters. + +Specific steps are as follows: + +1. Quantize key/value +``` + qv,scale,zero_point=quantize_cache_v(v) +``` +2. Store into layer_past + +The following is the format of quantized `layer_past`: +``` + layer_past=((q_key,key_scale,key_zero_point), + (q_value,value_scale,value_zero_point)) +``` + +The original format of `layer_past` is shown below: +``` + layer_past=(key,value) +``` + +If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows: +``` + v=dequantize_cache_torch(qv,scale,zero_point) +``` +