diff --git a/README.md b/README.md index ae8ec89..db5e588 100644 --- a/README.md +++ b/README.md @@ -17,8 +17,8 @@ | | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| -| 7B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | -| 14B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | @@ -205,10 +205,10 @@ from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir -# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4') -model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4') +# model_dir = snapshot_download('qwen/Qwen-7B') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat') +# model_dir = snapshot_download('qwen/Qwen-14B') +model_dir = snapshot_download('qwen/Qwen-14B-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers @@ -229,9 +229,9 @@ from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig # Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" -tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) -model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 response, history = model.chat(tokenizer, "你好", history=None) print(response) @@ -365,6 +365,11 @@ We illustrate the model performance of both BF16, Int8 and Int4 models on the be | Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 | ### Quantization of KV cache + +> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality +> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_245.cu`) may be missing. Please manually download +> them from the Hugging Face Hub and place them into the same folder as the other module files. + Attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The parameters of 'use_cache_quantization' and 'use_cache_kernel' are provided to control kv-cache-quantization behavior When use_cache_quantization=True and use_cache_kernel=True, kv-cache-quantization will be enabled. The specific use method is as follows: @@ -610,6 +615,9 @@ sh finetune/finetune_qlora_ds.sh For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use deepspeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. +> NOTE: Please be aware that due to the internal mechanisms of Hugging Face, certain non-Python files (e.g., `*.cpp` and `*.cu`) +> may be missing from the saved checkpoint. You may need to manually copy them to the directory containing other files. + Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below: ```python @@ -639,6 +647,19 @@ merged_model = model.merge_and_unload() merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` +The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) + +tokenizer.save_pretrained(new_model_directory) +``` + + Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed. diff --git a/README_CN.md b/README_CN.md index a38db3a..f41acdf 100644 --- a/README_CN.md +++ b/README_CN.md @@ -17,8 +17,8 @@ | | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| -| 7B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | -| 14B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 我们开源了**Qwen**(通义千问)系列工作,当前开源模型的参数规模为70亿(7B)和140亿(14B)。本次开源包括基础模型**Qwen**,即**Qwen-7B**和**Qwen-14B**,以及对话模型**Qwen-Chat**,即**Qwen-7B-Chat**和**Qwen-14B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的技术报告,请点击上方论文链接查看。 @@ -196,10 +196,10 @@ from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir -# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4') -model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4') +# model_dir = snapshot_download('qwen/Qwen-7B') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat') +# model_dir = snapshot_download('qwen/Qwen-14B') +model_dir = snapshot_download('qwen/Qwen-14B-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers @@ -220,9 +220,9 @@ from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig # 可选的模型包括: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" -tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) -model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 response, history = model.chat(tokenizer, "你好", history=None) print(response) @@ -360,6 +360,8 @@ response, history = model.chat(tokenizer, "Hi", history=None) ### KV cache量化 +> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_356.cpp`与`cache_autogptq_cuda_kernel_245.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。 + 在模型infer时,可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。 提供use_cache_quantization以及use_cache_kernel两个参数对模型控制,当use_cache_quantization以及use_cache_kernel均开启时,将启动kv-cache量化的功能。具体使用如下: @@ -594,6 +596,8 @@ sh finetune/finetune_qlora_ds.sh 我们建议你使用我们提供的Int4量化模型进行训练,即Qwen-7B-Chat-Int4。请**不要使用**非量化模型!与全参数微调以及LoRA不同,Q-LoRA仅支持fp16。注意,由于我们发现torch amp支持的fp16混合精度训练存在问题,因此当前的单卡训练Q-LoRA必须使用DeepSpeed。此外,上述LoRA关于特殊token的问题在Q-LoRA依然存在。并且,Int4模型的参数无法被设为可训练的参数。所幸的是,我们只提供了Chat模型的Int4模型,因此你不用担心这个问题。但是,如果你执意要在Q-LoRA中引入新的特殊token,很抱歉,我们无法保证你能成功训练。 +> 注意:由于Hugging Face的内部实现,模型在保存时,一些非Python文件未保存(例如`*.cpp`与`*.cu`),如需要支持相关功能,请手动复制有关文件。 + 与全参数微调不同,LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型,你需要使用如下方法。假设你使用Qwen-7B训练模型,你可以用如下代码读取模型: ```python @@ -623,6 +627,17 @@ merged_model = model.merge_and_unload() merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` +`new_model_directory`目录将包含合并后的模型参数与相关模型代码。请注意`*.cu`和`*.cpp`文件可能没被保存,请手动复制。另外,`merge_and_unload`仅保存模型,并未保存tokenizer,如有需要,请复制相关文件或使用以以下代码保存 +```python +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained( + path_to_adapter, # path to the output directory + trust_remote_code=True +) +tokenizer.save_pretrained(new_model_directory) +``` + + 注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。 ### 显存占用及训练速度 diff --git a/README_FR.md b/README_FR.md index 45ca203..e5d0d14 100644 --- a/README_FR.md +++ b/README_FR.md @@ -17,8 +17,8 @@ | | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| -| 7B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | -| 14B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | @@ -205,10 +205,10 @@ from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir -# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4') -model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4') +# model_dir = snapshot_download('qwen/Qwen-7B') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat') +# model_dir = snapshot_download('qwen/Qwen-14B') +model_dir = snapshot_download('qwen/Qwen-14B-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers @@ -229,9 +229,9 @@ from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig # Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat" -tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) -model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 response, history = model.chat(tokenizer, "你好", history=None) print(response) diff --git a/README_JA.md b/README_JA.md index 95becc1..6955cf3 100644 --- a/README_JA.md +++ b/README_JA.md @@ -22,8 +22,8 @@ | | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen | |-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:| -| 7B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | -| 14B | 🤖 🤗 | 🤖 🤗 | 🤗 | 🤖 🤗 | +| 7B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | +| 14B | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | 🤖 🤗 | @@ -200,10 +200,10 @@ from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir -# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4') -# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4') -model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4') +# model_dir = snapshot_download('qwen/Qwen-7B') +# model_dir = snapshot_download('qwen/Qwen-7B-Chat') +# model_dir = snapshot_download('qwen/Qwen-14B') +model_dir = snapshot_download('qwen/Qwen-14B-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers @@ -224,9 +224,9 @@ from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig # Model names:"Qwen/Qwen-7B-Chat"、"Qwen/Qwen-14B-Chat" -tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) -model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval() -model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 +tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() +model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参 response, history = model.chat(tokenizer, "你好", history=None) print(response)