Isekai-Qwen/README.md

<p align="left">
        <a href="README_CN.md">中文</a>&nbsp ｜ &nbspEnglish&nbsp ｜ &nbsp<a href="README_JA.md">日本語</a>
</p>
<br><br>

<p align="center">
    <img src="assets/logoqwen.jpg" width="400"/>
<p>
<br>

<p align="center">
        🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">ModelScope<a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper&nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp ｜ &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
</p>
<br><br>

|    |                                                              Qwen-Chat                                                               |                                                                Qwen-Chat (Int4)                                                                |                                                            Qwen                                                            |
|----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 7B |  <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>  |  <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>  |  <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>  |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |


We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-7B** and **Qwen-14B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-7B-Chat** and **Qwen-14B-Chat**. Links are on the above table. Click them and check the model cards.

In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc.

In this repo, you can figure out:

* Quickstart with Qwen, and enjoy the simple inference.
* Details about the quantization models, including usage, memory, inference speed. For comparison, we also provide the statistics of the BF16 models.
* Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA.
* Instructions on building demos, including WebUI, CLI demo, etc.
* Information about Qwen for tool use, agent, and code interpreter
* Statistics of long-context understanding evaluation
* License agreement
* ...

Also, if you meet problems, turn to [FAQ](FAQ.md) for help first. Still feeling struggled? Feel free to shoot us issues (better in English so that more people can understand you)! If you would like to help us, send us pull requests with no hesitation! We are always excited about PR! 

Would like to chat with us or date us coffee time? Welcome to our Discord or WeChat! 
<br><br>

## News and Updates

* 2023.9.25 🔥 We release both **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face. At the same time, we update **Qwen-7B** and **Qwen-7B-Chat**. Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. **PLEASE MAKE SURE YOU ARE USING THE LATEST CODES AND CHECKPOINTS!**
* 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA.
* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
<br>

## Performance

Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the context length is extended from 2048 to 8192) outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. However, even Qwen-14B still significantly fall behind GPT-3.5, let alone GPT-4. See the results below. 

<p align="left">
    <img src="assets/radar_14b.jpg" width="600"/>
<p>
<br>

| Model              |   MMLU   |  C-Eval  |  GSM8K   |   MATH   | HumanEval |   MBPP    |   BBH    |  CMMLU   |
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
|                    |  5-shot  |  5-shot  |  8-shot  |  4-shot  |  0-shot   |  3-shot   |  3-shot  |  5-shot  |
| LLaMA2-7B          |   46.8   |   32.5   |   16.7   |   3.3    |   12.8    |   20.8    |   38.2   |   31.8   |
| LLaMA2-13B         |   55.0   |   41.4   |   29.6   |   5.0    |   18.9    |   30.3    |   45.6   |   38.4   |
| LLaMA2-34B         |   62.6   |    -     |   42.2   |   6.2    |   22.6    |   33.0    |   44.1   |    -     |
| ChatGLM2-6B        |   47.9   |   51.7   |   32.4   |   6.5    |     -     |     -     |   33.7   |    -     |
| InternLM-7B        |   51.0   |   52.8   |   31.2   |   6.3    |   10.4    |   14.0    |   37.0   |   51.8   |
| InternLM-20B       |   62.1   |   58.8   |   52.6   |   7.9    |   25.6    |   35.6    |   52.5   |   59.0   |
| Baichuan2-7B       |   54.2   |   54.0   |   24.5   |   5.6    |   18.3    |   24.2    |   41.6   |   57.1   |
| Baichuan2-13B      |   59.2   |   58.1   |   52.8   |   10.1   |   17.1    |   30.2    |   48.8   |   62.0   |
| Qwen-7B (original) |   56.7   |   59.6   |   51.6   |   10.4    |   24.4    |   31.2    |   40.6   |   58.8   |
| **Qwen-7B**        |   58.2   |   63.5   |   51.7   |   11.6   |   29.9    |   31.6    |   45.0   |   62.2   |
| **Qwen-14B**       | **66.3** | **72.1** | **61.3** | **24.8** | **32.3**  | **40.8**  | **53.4** | **71.0** |

For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm). 

For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical report by clicking [here](TODO).
<br><br>

## Requirements

* python 3.8 and above
* pytorch 1.12 and above, 2.0 and above are recommended
* transformers 4.32 and above
* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
<br>

## Quickstart

Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

```bash
pip install -r requirements.txt
```

If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)

```bash
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# pip install csrc/rotary
```

Now you can start with ModelScope or Transformers.

#### 🤗 Transformers

To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

# 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好！很高兴为你提供帮助。

# 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。

# 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业：一个年轻人的成功之路》
```

Running Qwen pretrained base model is also simple.

<details>
  <summary>Running Qwen</summary>

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" 
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
```

</details>

#### 🤖 ModelScope

ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:

```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history) 
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
print(response)
```
<br>

## Quantization

### Usage

We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4) and Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4), which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

```bash
pip install auto-gptq optimum
```

If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.

Then you can load the quantized model easily and run inference as same as usual:

```python
# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "Hi", history=None)
```

### Performance

We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

| Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
|----------------------|:----:|:-----------:|:-----:|:---------:|
| Qwen-7B-Chat (BF16)  | 53.9 |    54.2     | 41.1  |   24.4    |
| Qwen-7B-Chat (Int4)  | 52.6 |    52.9     | 38.1  |   23.8    |
| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 61.0  |   43.9    |
| Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |

### Inference Speed

We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.

| Quantization         | Speed (2048 tokens) | Speed (8192 tokens) |
|----------------------|:-------------------:|:-------------------:|
| Qwen-7B-Chat (BF16)  |        30.34        |        29.32        |
| Qwen-7B-Chat (Int4)  |        43.56        |        33.92        |
| Qwen-14B-Chat (BF16) |        30.70        |        21.73        |
| Qwen-14B-Chat (Int4) |        37.11        |        26.11        |

In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.

### GPU Memory Usage

We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.

| Quantization         | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
|----------------------|:-----------------------------------:|:-------------------------------------:|
| Qwen-7B-Chat (BF16)  |               17.66GB               |                22.58GB                |
| Qwen-7B-Chat (Int4)  |               8.21GB                |                13.62GB                |
| Qwen-14B-Chat (BF16) |               30.15GB                 |                38.94GB                  |
| Qwen-14B-Chat (Int4) |               13.00GB                 |                21.79GB                  |

The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
<br><br>

## Finetuning

Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed, and thus we advise you to install DeepSpeed before you start.

To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
```json
[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好",
      },
      {
        "from": "assistant",
        "value": "我是一个语言模型，我叫通义千问。"
      }
    ]
  }
]
```

After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`.

The finetuning scripts allow you to perform:
- Full-parameter finetuning
- LoRA
- Q-LoRA

Full-parameter parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script:

```bash
# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training.
sh finetune/finetune_ds.sh
```

Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. Another thing to notice is that we use DeepSpeed ZeRO 3 in this script. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default.

Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. Also, this script support both bf16 and fp16.

```bash
# Single GPU training
sh finetune/finetune_lora_single_gpu.sh
# Distributed training
sh finetune/finetune_lora_ds.sh
```

In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script:

```bash
# Single GPU training
sh finetune/finetune_qlora_single_gpu.sh
# Distributed training
sh finetune/finetune_qlora_ds.sh
```

For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. However, different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA.

Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below:

```python
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()
```

The shell scripts uses `torchrun` to run single-GPU or multi-GPU training. For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. 
<br><br>

## Demo

### Web UI

We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:

```
pip install -r requirements_web_demo.txt
```

Then run the command below and click on the generated link:

```
python web_demo.py
```

<p align="center">
    <br>
    <img src="assets/web_demo.gif" width="600" />
    <br>
<p>

### CLI Demo

We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:

```
python cli_demo.py
```

<p align="center">
    <br>
    <img src="assets/cli_demo.gif" width="600" />
    <br>
<p>
<br>

## API

We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:

```bash
pip install fastapi uvicorn openai pydantic sse_starlette
```

Then run the command to deploy your API:

```bash
python openai_api.py
```

You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.

Using the API is also simple. See the example below:

```python
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

# create a request activating streaming response
for chunk in openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "你好"}
    ],
    stream=True 
    # Specifying stop words in streaming output format is not yet supported and is under development.
):
    if hasattr(chunk.choices[0].delta, "content"):
        print(chunk.choices[0].delta.content, end="", flush=True)

# create a request not activating streaming response
response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "你好"}
    ],
    stream=False,
    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
)
print(response.choices[0].message.content)
```

<p align="center">
    <br>
    <img src="assets/openai_api.gif" width="600" />
    <br>
<p>

Function calling is also supported (but only when `stream=False` for the moment). See the [example usage](examples/function_call_examples.py) here.
<br><br>

## Deployment

It is simple to run the model on CPU, which requires your specification of device:

```python
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
```

If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`:

```python
from utils import load_model_on_gpus
model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
```

Then you can run the 7B chat model on 2 GPUs using the above scripts.
<br><br>

## Tool Usage

Qwen-Chat has been optimized for tool usage and function calling capabilities. Users can develop agents, LangChain applications, and even agument Qwen with a Python Code Interpreter.

We provide documentation on how to implement tool calls based on the principle of ReAct Prompting, please refer to [the ReAct example](examples/react_prompt.md). Based on this principle, we provide support for function calling in [openai_api.py](openai_api.py).

We have tested the model's tool calling capabilities on our open-source Chinese evaluation benchmark and found that Qwen-Chat consistently performs well:

<table>
    <tr>
        <th colspan="4" align="center">Chinese Tool-Use Benchmark</th>
    </tr>
    <tr>
        <th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
    </tr>
    <tr>
        <td>GPT-4</td><td align="center">95%</td><td align="center">0.90</td><td align="center">15.0%</td>
    </tr>
    <tr>
        <td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
    </tr>
    <tr>
        <td>Qwen-7B-Chat</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
    </tr>
    <tr>
        <td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
    </tr>
</table>

To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark).

We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:

<table>
    <tr>
        <th colspan="4" align="center">Executable Rate of Generated Code (%)</th>
    </tr>
    <tr>
        <th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
    </tr>
    <tr>
        <td>GPT-4</td><td align="center">91.9</td><td align="center">85.9</td><td align="center">82.8</td>
    </tr>
    <tr>
        <td>GPT-3.5</td><td align="center">89.2</td><td align="center">65.0</td><td align="center">74.1</td>
    </tr>
    <tr>
        <td>LLaMA2-7B-Chat</td>
        <td align="center">41.9</td>
        <td align="center">33.1</td>
        <td align="center">24.1 </td>
    </tr>
    <tr>
        <td>LLaMA2-13B-Chat</td>
        <td align="center">50.0</td>
        <td align="center">40.5</td>
        <td align="center">48.3 </td>
    </tr>
    <tr>
        <td>CodeLLaMA-7B-Instruct</td>
        <td align="center">85.1</td>
        <td align="center">54.0</td>
        <td align="center">70.7 </td>
    </tr>
    <tr>
        <td>CodeLLaMA-13B-Instruct</td>
        <td align="center">93.2</td>
        <td align="center">55.8</td>
        <td align="center">74.1 </td>
    </tr>
    <tr>
        <td>InternLM-7B-Chat-v1.1</td>
        <td align="center">78.4</td>
        <td align="center">44.2</td>
        <td align="center">62.1 </td>
    </tr>
    <tr>
        <td>InternLM-20B-Chat</td>
        <td align="center">70.3</td>
        <td align="center">44.2</td>
        <td align="center">65.5 </td>
    </tr>
    <tr>
        <td>Qwen-7B-Chat</td>
        <td align="center">82.4</td>
        <td align="center">64.4</td>
        <td align="center">67.2 </td>
    </tr>
    <tr>
        <td>Qwen-14B-Chat</td>
        <td align="center">89.2</td>
        <td align="center">84.1</td>
        <td align="center">65.5</td>
    </tr>
</table>

<table>
    <tr>
        <th colspan="4" align="center">Accuracy of Code Execution Results (%)</th>
    </tr>
    <tr>
        <th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
    </tr>
    <tr>
        <td>GPT-4</td><td align="center">82.8</td><td align="center">66.7</td><td align="center">60.8</td>
    </tr>
    <tr>
        <td>GPT-3.5</td><td align="center">47.3</td><td align="center">33.3</td><td align="center">55.7</td>
    </tr>
    <tr>
        <td>LLaMA2-7B-Chat</td>
        <td align="center">3.9</td>
        <td align="center">14.3</td>
        <td align="center">39.2 </td>
    </tr>
    <tr>
        <td>LLaMA2-13B-Chat</td>
        <td align="center">8.3</td>
        <td align="center">8.3</td>
        <td align="center">40.5 </td>
    </tr>
    <tr>
        <td>CodeLLaMA-7B-Instruct</td>
        <td align="center">14.3</td>
        <td align="center">26.2</td>
        <td align="center">60.8 </td>
    </tr>
    <tr>
        <td>CodeLLaMA-13B-Instruct</td>
        <td align="center">28.2</td>
        <td align="center">27.4</td>
        <td align="center">62.0 </td>
    </tr>
    <tr>
        <td>InternLM-7B-Chat-v1.1</td>
        <td align="center">28.5</td>
        <td align="center">4.8</td>
        <td align="center">40.5 </td>
    </tr>
    <tr>
        <td>InternLM-20B-Chat</td>
        <td align="center">34.6</td>
        <td align="center">21.4</td>
        <td align="center">45.6 </td>
    </tr>
    <tr>
        <td>Qwen-7B-Chat</td>
        <td align="center">41.9</td>
        <td align="center">40.5</td>
        <td align="center">54.4 </td>
    </tr>
    <tr>
        <td>Qwen-14B-Chat</td>
        <td align="center">58.4</td>
        <td align="center">53.6</td>
        <td align="center">59.5</td>
    </tr>
</table>

<p align="center">
    <br>
    <img src="assets/code_interpreter_showcase_001.jpg" />
    <br>
<p>

In addition, we also provide experimental results demonstrating that our model is capable of acting as a HuggingFace Agent. For more information, please refer to the [example documentation](examples/transformers_agent.md). The model's performance on the evaluation dataset provided by Hugging Face is as follows:

<table>
    <tr>
        <th colspan="4" align="center">HuggingFace Agent Benchmark- Run Mode</th>
    </tr>
    <tr>
        <th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
    </tr>
    <tr>
        <td>GPT-4</td><td align="center">100</td><td align="center">100</td><td align="center">97.4</td>
    </tr>
    <tr>
        <td>GPT-3.5</td><td align="center">95.4</td><td align="center">96.3</td><td align="center">87.0</td>
    </tr>
    <tr>
        <td>StarCoder-Base-15B</td><td align="center">86.1</td><td align="center">87.0</td><td align="center">68.9</td>
    </tr>
    <tr>
        <td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
    </tr>
    <tr>
        <td>Qwen-7B-Chat</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
    </tr>
    <tr>
        <td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
    </tr>
</table>

<table>
    <tr>
        <th colspan="4" align="center">HuggingFace Agent Benchmark - Chat Mode</th>
    </tr>
    <tr>
        <th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
    </tr>
    <tr>
        <td>GPT-4</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">98.5</td>
    </tr>
    <tr>
        <td>GPT-3.5</td><td align="center">97.3</td><td align="center">96.8</td><td align="center">89.6</td>
    </tr>
    <tr>
        <td>StarCoder-Base-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">91.1</td>
    </tr>
    <tr>
        <td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
    </tr>
    <tr>
        <td>Qwen-7B-Chat</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
    </tr>
    <tr>
        <td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
    </tr>
</table>

<br>

## Long-Context Understanding

To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-7B/14B from 2k to over 8K tokens, and Qwen-7B from 8k to 32k tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below:

<table>
    <tr>
        <th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
    </tr>
    <tr>
        <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
    </tr>
     <tr>
        <td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
    </tr>
    <tr>
        <td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
    </tr>
    <tr>
        <td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
    </tr>
    <tr>
        <td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
    </tr>
    <tr>
    <tr>
        <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
    </tr>
    <tr>
        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
    </tr>
    <tr>
        <td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
    </tr>
    <tr>
        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
    </tr>
</table>


## Tokenizer

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
<br><br>

## Reproduction

For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. Note that the reproduction may lead to slight differences from our reported results.
<br><br>

## FAQ

If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.
<br><br>

## License Agreement

Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
<br><br>

## Contact Us

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.
-												update README

											
										
										
											1 year ago
+								<p align="left">
 								        <a href="README_CN.md">中文</a>&nbsp ｜ &nbspEnglish&nbsp ｜ &nbsp<a href="README_JA.md">日本語</a>
 								</p>
 								<br><br>
-												Update README.md
											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								<p align="center">
-												release latest models

											
										
										
											1 year ago
+								    <img src="assets/logoqwen.jpg" width="400"/>
-												first commit

											
										
										
											1 year ago
+								<p>
 								<br>
 								<p align="center">
-												release latest models

											
										
										
											1 year ago
+								        🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/qwen">ModelScope<a>&nbsp&nbsp | &nbsp&nbsp 📑 Paper&nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
-												update README

											
										
										
											1 year ago
+								<br>
-												release latest models

											
										
										
											1 year ago
+								<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp ｜ &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
-												first commit

											
										
										
											1 year ago
+								</p>
 								<br><br>
-												release latest models

											
										
										
											1 year ago
+								|    |                                                              Qwen-Chat                                                               |                                                                Qwen-Chat (Int4)                                                                |                                                            Qwen                                                            |
 								|----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
 								| 7B |  <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>  |  <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>  |  <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>  |
 								| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖 <a>  <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
-												Update README.md
											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-7B** and **Qwen-14B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-7B-Chat** and **Qwen-14B-Chat**. Links are on the above table. Click them and check the model cards.
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc.
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								In this repo, you can figure out:
 								* Quickstart with Qwen, and enjoy the simple inference.
 								* Details about the quantization models, including usage, memory, inference speed. For comparison, we also provide the statistics of the BF16 models.
 								* Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA.
 								* Instructions on building demos, including WebUI, CLI demo, etc.
 								* Information about Qwen for tool use, agent, and code interpreter
 								* Statistics of long-context understanding evaluation
 								* License agreement
 								* ...
 								Also, if you meet problems, turn to [FAQ](FAQ.md) for help first. Still feeling struggled? Feel free to shoot us issues (better in English so that more people can understand you)! If you would like to help us, send us pull requests with no hesitation! We are always excited about PR!
 								Would like to chat with us or date us coffee time? Welcome to our Discord or WeChat!
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update readme

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								## News and Updates
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								* 2023.9.25 🔥 We release both **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face. At the same time, we update **Qwen-7B** and **Qwen-7B-Chat**. Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved. **PLEASE MAKE SURE YOU ARE USING THE LATEST CODES AND CHECKPOINTS!**
-												update readme

											
										
										
											1 year ago
+								* 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA.
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
 								* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
-												update readme

											
										
										
											1 year ago
+								<br>
-												first commit

											
										
										
											1 year ago
 								## Performance
-												release latest models

											
										
										
											1 year ago
+								Qwen-14B and Qwen-7B (this is the new version trained with more tokens and the context length is extended from 2048 to 8192) outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. However, even Qwen-14B still significantly fall behind GPT-3.5, let alone GPT-4. See the results below.
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								<p align="left">
 								    <img src="assets/radar_14b.jpg" width="600"/>
-												Update README.md
											
										
										
											1 year ago
+								<p>
 								<br>
-												release latest models

											
										
										
											1 year ago
+								| Model              |   MMLU   |  C-Eval  |  GSM8K   |   MATH   | HumanEval |   MBPP    |   BBH    |  CMMLU   |
 								|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|:--------:|:--------:|
 								|                    |  5-shot  |  5-shot  |  8-shot  |  4-shot  |  0-shot   |  3-shot   |  3-shot  |  5-shot  |
 								| LLaMA2-7B          |   46.8   |   32.5   |   16.7   |   3.3    |   12.8    |   20.8    |   38.2   |   31.8   |
 								| LLaMA2-13B         |   55.0   |   41.4   |   29.6   |   5.0    |   18.9    |   30.3    |   45.6   |   38.4   |
 								| LLaMA2-34B         |   62.6   |    -     |   42.2   |   6.2    |   22.6    |   33.0    |   44.1   |    -     |
 								| ChatGLM2-6B        |   47.9   |   51.7   |   32.4   |   6.5    |     -     |     -     |   33.7   |    -     |
 								| InternLM-7B        |   51.0   |   52.8   |   31.2   |   6.3    |   10.4    |   14.0    |   37.0   |   51.8   |
 								| InternLM-20B       |   62.1   |   58.8   |   52.6   |   7.9    |   25.6    |   35.6    |   52.5   |   59.0   |
 								| Baichuan2-7B       |   54.2   |   54.0   |   24.5   |   5.6    |   18.3    |   24.2    |   41.6   |   57.1   |
 								| Baichuan2-13B      |   59.2   |   58.1   |   52.8   |   10.1   |   17.1    |   30.2    |   48.8   |   62.0   |
 								| Qwen-7B (original) |   56.7   |   59.6   |   51.6   |   10.4    |   24.4    |   31.2    |   40.6   |   58.8   |
 								| **Qwen-7B**        |   58.2   |   63.5   |   51.7   |   11.6   |   29.9    |   31.6    |   45.0   |   62.2   |
 								| **Qwen-14B**       | **66.3** | **72.1** | **61.3** | **24.8** | **32.3**  | **40.8**  | **53.4** | **71.0** |
 								For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm).
 								For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical report by clicking [here](TODO).
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												first commit

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								## Requirements
 								* python 3.8 and above
 								* pytorch 1.12 and above, 2.0 and above are recommended
-												add finetuning

											
										
										
											1 year ago
+								* transformers 4.32 and above
-												Update README.md
											
										
										
											1 year ago
+								* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
-												update readme to support easier load of model

											
										
										
											1 year ago
+								<br>
-												Update README.md
											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								## Quickstart
-												release latest models

											
										
										
											1 year ago
+								Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers.
-												first commit

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
-												first commit

											
										
										
											1 year ago
 								```bash
-												Update README.md
											
										
										
											1 year ago
+								pip install -r requirements.txt
-												first commit

											
										
										
											1 year ago
+								```
-												Update README.md
											
										
										
											1 year ago
+								If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
-												first commit

											
										
										
											1 year ago
 								```bash
 								git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
 								cd flash-attention && pip install .
-												Update README.md
											
										
										
											1 year ago
+								# Below are optional. Installing them might be slow.
-												update readme

											
										
										
											1 year ago
+								# pip install csrc/layer_norm
 								# pip install csrc/rotary
-												first commit

											
										
										
											1 year ago
+								```
 								Now you can start with ModelScope or Transformers.
 								#### 🤗 Transformers
-												release latest models

											
										
										
											1 year ago
+								To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**
-												first commit

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								```python
-												Update README.md
											
										
										
											1 year ago
+								from transformers import AutoModelForCausalLM, AutoTokenizer
 								from transformers.generation import GenerationConfig
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
-												Update README.md
											
										
										
											1 year ago
+								tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
-												Update README.md
											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								# use bf16
-												Update README.md, fix typo
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use fp16
-												Update README.md, fix typo
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use cpu only
-												Update quickusage
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use auto mode, automatically select precision based on the device.
-												update readme to support easier load of model

											
										
										
											1 year ago
+								model = AutoModelForCausalLM.from_pretrained(
 								    "Qwen/Qwen-7B-Chat",
 								    device_map="auto",
 								    trust_remote_code=True
 								).eval()
-												Update README.md
											
										
										
											1 year ago
-												update readme to support easier load of model

											
										
										
											1 year ago
+								# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
 								# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
-												Update README.md
											
										
										
											1 year ago
-												update readme

											
										
										
											1 year ago
+								# 1st dialogue turn
-												Update README.md
											
										
										
											1 year ago
+								response, history = model.chat(tokenizer, "你好", history=None)
 								print(response)
 								# 你好！很高兴为你提供帮助。
-												update readme

											
										
										
											1 year ago
+								# 2nd dialogue turn
-												release the evaluation benchmark for tool use; update tool use results to that of the hf version

											
										
										
											1 year ago
+								response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
-												Update README.md
											
										
										
											1 year ago
+								print(response)
 								# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
 								# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
 								# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
 								# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
 								# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
 								# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
-												update readme

											
										
										
											1 year ago
+								# 3rd dialogue turn
-												Update README.md
											
										
										
											1 year ago
+								response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
 								print(response)
 								# 《奋斗创业：一个年轻人的成功之路》
-												first commit

											
										
										
											1 year ago
+								```
-												release latest models

											
										
										
											1 year ago
+								Running Qwen pretrained base model is also simple.
-												Update README.md
											
										
										
											1 year ago
 								<details>
-												release latest models

											
										
										
											1 year ago
+								  <summary>Running Qwen</summary>
-												Update quickusage
											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								```python
 								from transformers import AutoModelForCausalLM, AutoTokenizer
 								from transformers.generation import GenerationConfig
-												release latest models

											
										
										
											1 year ago
+								# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
-												Update README.md
											
										
										
											1 year ago
+								tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
-												Update README.md
											
										
										
											1 year ago
+								# use bf16
-												Update README.md, fix typo
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use fp16
-												Update README.md, fix typo
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use cpu only
-												Update quickusage
											
										
										
											1 year ago
+								# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
-												Update README.md
											
										
										
											1 year ago
+								# use auto mode, automatically select precision based on the device.
-												update readme to support easier load of model

											
										
										
											1 year ago
+								model = AutoModelForCausalLM.from_pretrained(
 								    "Qwen/Qwen-7B",
 								    device_map="auto",
 								    trust_remote_code=True
 								).eval()
-												Update README.md
											
										
										
											1 year ago
-												update readme to support easier load of model

											
										
										
											1 year ago
+								# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
 								# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
-												Update README.md
											
										
										
											1 year ago
 								inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
-												Update README.md
											
										
										
											1 year ago
+								inputs = inputs.to(model.device)
-												Update README.md
											
										
										
											1 year ago
+								pred = model.generate(**inputs)
 								print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 								# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
 								```
-												Update quickusage
											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								</details>
-												first commit

											
										
										
											1 year ago
+								#### 🤖 ModelScope
 								ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
-												Update README.md
											
										
										
											1 year ago
+								```python
-												Update README.md
											
										
										
											1 year ago
+								from modelscope import AutoModelForCausalLM, AutoTokenizer
 								from modelscope import GenerationConfig
-												release latest models

											
										
										
											1 year ago
+								# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
-												Update README.md
											
										
										
											1 year ago
+								tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
 								model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
 								model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
-												Update README.md
											
										
										
											1 year ago
 								response, history = model.chat(tokenizer, "你好", history=None)
 								print(response)
 								response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history)
 								print(response)
 								response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
 								print(response)
-												first commit

											
										
										
											1 year ago
+								```
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
+								<br>
-												first commit

											
										
										
											1 year ago
+								## Quantization
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								### Usage
-												Update quickusage
											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4) and Qwen-14B-Chat [Click here](https://huggingface.co/Qwen/Qwen-14B-Chat-Int4), which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.
-												first commit

											
										
										
											1 year ago
-												update readme to support easier load of model

											
										
										
											1 year ago
+								Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
-												update readme

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								```bash
-												update readme to support easier load of model

											
										
										
											1 year ago
+								pip install auto-gptq optimum
-												update readme

											
										
										
											1 year ago
+								```
-												Update README.md
											
										
										
											1 year ago
-												update readme to support easier load of model

											
										
										
											1 year ago
+								If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.
-												first commit

											
										
										
											1 year ago
-												update readme to support easier load of model

											
										
										
											1 year ago
+								Then you can load the quantized model easily and run inference as same as usual:
-												first commit

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								```python
-												release latest models

											
										
										
											1 year ago
+								# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
-												update readme to support easier load of model

											
										
										
											1 year ago
+								model = AutoModelForCausalLM.from_pretrained(
 								    "Qwen/Qwen-7B-Chat-Int4",
 								    device_map="auto",
 								    trust_remote_code=True
 								).eval()
 								response, history = model.chat(tokenizer, "Hi", history=None)
-												first commit

											
										
										
											1 year ago
+								```
-												release latest models

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								### Performance
-												first commit

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
-												update efficiency profiling in readme

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								| Quantization         | MMLU | CEval (val) | GSM8K | Humaneval |
 								|----------------------|:----:|:-----------:|:-----:|:---------:|
 								| Qwen-7B-Chat (BF16)  | 53.9 |    54.2     | 41.1  |   24.4    |
 								| Qwen-7B-Chat (Int4)  | 52.6 |    52.9     | 38.1  |   23.8    |
 								| Qwen-14B-Chat (BF16) | 64.6 |    69.8     | 61.0  |   43.9    |
 								| Qwen-14B-Chat (Int4) | 63.3 |    69.0     | 59.8  |   45.7    |
-												update efficiency profiling in readme

											
										
										
											1 year ago
 								### Inference Speed
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.
-												update efficiency profiling in readme

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								| Quantization         | Speed (2048 tokens) | Speed (8192 tokens) |
 								|----------------------|:-------------------:|:-------------------:|
 								| Qwen-7B-Chat (BF16)  |        30.34        |        29.32        |
 								| Qwen-7B-Chat (Int4)  |        43.56        |        33.92        |
 								| Qwen-14B-Chat (BF16) |        30.70        |        21.73        |
 								| Qwen-14B-Chat (Int4) |        37.11        |        26.11        |
-												update efficiency profiling in readme

											
										
										
											1 year ago
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
-												update efficiency profiling in readme

											
										
										
											1 year ago
 								### GPU Memory Usage
-												update new version of quantization and inference efficiency profiling result

											
										
										
											1 year ago
+								We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
-												update efficiency profiling in readme

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								| Quantization         | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
 								|----------------------|:-----------------------------------:|:-------------------------------------:|
 								| Qwen-7B-Chat (BF16)  |               17.66GB               |                22.58GB                |
 								| Qwen-7B-Chat (Int4)  |               8.21GB                |                13.62GB                |
 								| Qwen-14B-Chat (BF16) |               30.15GB                 |                38.94GB                  |
 								| Qwen-14B-Chat (Int4) |               13.00GB                 |                21.79GB                  |
-												update efficiency profiling in readme

											
										
										
											1 year ago
 								The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update readme

											
										
										
											1 year ago
-												add finetuning

											
										
										
											1 year ago
+								## Finetuning
 								Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed, and thus we advise you to install DeepSpeed before you start.
 								To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
 								```json
 								[
 								  {
 								    "id": "identity_0",
 								    "conversations": [
 								      {
 								        "from": "user",
 								        "value": "你好",
 								      },
 								      {
 								        "from": "assistant",
 								        "value": "我是一个语言模型，我叫通义千问。"
 								      }
 								    ]
 								  }
 								]
 								```
 								After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`.
 								The finetuning scripts allow you to perform:
 								- Full-parameter finetuning
 								- LoRA
 								- Q-LoRA
 								Full-parameter parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script:
 								```bash
 								# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training.
 								sh finetune/finetune_ds.sh
 								```
 								Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. Another thing to notice is that we use DeepSpeed ZeRO 3 in this script. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default.
 								Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. Also, this script support both bf16 and fp16.
 								```bash
 								# Single GPU training
 								sh finetune/finetune_lora_single_gpu.sh
 								# Distributed training
 								sh finetune/finetune_lora_ds.sh
 								```
 								In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script:
 								```bash
 								# Single GPU training
 								sh finetune/finetune_qlora_single_gpu.sh
 								# Distributed training
 								sh finetune/finetune_qlora_ds.sh
 								```
 								For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. However, different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA.
 								Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below:
 								```python
 								from peft import AutoPeftModelForCausalLM
 								model = AutoPeftModelForCausalLM.from_pretrained(
 								    path_to_adapter, # path to the output directory
 								    device_map="auto",
 								    trust_remote_code=True
 								).eval()
 								```
 								The shell scripts uses `torchrun` to run single-GPU or multi-GPU training. For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												add finetuning

											
										
										
											1 year ago
-												update readme

											
										
										
											1 year ago
+								## Demo
 								### Web UI
-												update readme

											
										
										
											1 year ago
+								We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
-												first commit

											
										
										
											1 year ago
-												update readme

											
										
										
											1 year ago
+								```
-												update web_demo

											
										
										
											1 year ago
+								pip install -r requirements_web_demo.txt
-												update readme

											
										
										
											1 year ago
+								```
-												Update README.md
											
										
										
											1 year ago
-												update readme

											
										
										
											1 year ago
+								Then run the command below and click on the generated link:
 								```
 								python web_demo.py
 								```
-												Update README.md
											
										
										
											1 year ago
-												update gifs

											
										
										
											1 year ago
+								<p align="center">
 								    <br>
 								    <img src="assets/web_demo.gif" width="600" />
 								    <br>
 								<p>
 								### CLI Demo
 								We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
 								```
 								python cli_demo.py
 								```
 								<p align="center">
 								    <br>
 								    <img src="assets/cli_demo.gif" width="600" />
 								    <br>
 								<p>
-												update readme

											
										
										
											1 year ago
+								<br>
-												update gifs

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								## API
-												update gifs

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:
-												Update README.md
											
										
										
											1 year ago
+								```bash
-												Update README.md
											
										
										
											1 year ago
+								pip install fastapi uvicorn openai pydantic sse_starlette
 								```
-												update gifs

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								Then run the command to deploy your API:
-												update gifs

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								```bash
-												Update README.md
											
										
										
											1 year ago
+								python openai_api.py
 								```
-												update gifs

											
										
										
											1 year ago
-												Update README.md
											
										
										
											1 year ago
+								You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.
 								Using the API is also simple. See the example below:
-												Update README.md
											
										
										
											1 year ago
+								```python
-												Update README.md
											
										
										
											1 year ago
+								import openai
 								openai.api_base = "http://localhost:8000/v1"
 								openai.api_key = "none"
-												update readme

											
										
										
											1 year ago
 								# create a request activating streaming response
-												Update README.md
											
										
										
											1 year ago
+								for chunk in openai.ChatCompletion.create(
-												add function calling support

											
										
										
											1 year ago
+								    model="Qwen",
-												Update README.md
											
										
										
											1 year ago
+								    messages=[
 								        {"role": "user", "content": "你好"}
 								    ],
-												add stop word on openai api ChatCompletion

											
										
										
											1 year ago
+								    stream=True
 								    # Specifying stop words in streaming output format is not yet supported and is under development.
-												Update README.md
											
										
										
											1 year ago
+								):
 								    if hasattr(chunk.choices[0].delta, "content"):
 								        print(chunk.choices[0].delta.content, end="", flush=True)
-												update readme

											
										
										
											1 year ago
 								# create a request not activating streaming response
 								response = openai.ChatCompletion.create(
-												add function calling support

											
										
										
											1 year ago
+								    model="Qwen",
-												update readme

											
										
										
											1 year ago
+								    messages=[
 								        {"role": "user", "content": "你好"}
 								    ],
-												add stop word on openai api ChatCompletion

											
										
										
											1 year ago
+								    stream=False,
 								    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
-												update readme

											
										
										
											1 year ago
+								)
 								print(response.choices[0].message.content)
-												Update README.md
											
										
										
											1 year ago
+								```
-												update gifs

											
										
										
											1 year ago
+								<p align="center">
 								    <br>
 								    <img src="assets/openai_api.gif" width="600" />
 								    <br>
 								<p>
-												Update README.md
											
										
										
											1 year ago
-												add function calling support

											
										
										
											1 year ago
+								Function calling is also supported (but only when `stream=False` for the moment). See the [example usage](examples/function_call_examples.py) here.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												add function calling support

											
										
										
											1 year ago
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
+								## Deployment
 								It is simple to run the model on CPU, which requires your specification of device:
 								```python
 								model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
 								```
 								If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`:
-												release latest models

											
										
										
											1 year ago
+								```python
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
+								from utils import load_model_on_gpus
 								model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
 								```
 								Then you can run the 7B chat model on 2 GPUs using the above scripts.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								## Tool Usage
-												release latest models

											
										
										
											1 year ago
+								Qwen-Chat has been optimized for tool usage and function calling capabilities. Users can develop agents, LangChain applications, and even agument Qwen with a Python Code Interpreter.
 								We provide documentation on how to implement tool calls based on the principle of ReAct Prompting, please refer to [the ReAct example](examples/react_prompt.md). Based on this principle, we provide support for function calling in [openai_api.py](openai_api.py).
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								We have tested the model's tool calling capabilities on our open-source Chinese evaluation benchmark and found that Qwen-Chat consistently performs well:
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								<table>
 								    <tr>
 								        <th colspan="4" align="center">Chinese Tool-Use Benchmark</th>
 								    </tr>
 								    <tr>
 								        <th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
 								    </tr>
 								    <tr>
 								        <td>GPT-4</td><td align="center">95%</td><td align="center">0.90</td><td align="center">15.0%</td>
 								    </tr>
 								    <tr>
 								        <td>GPT-3.5</td><td align="center">85%</td><td align="center">0.88</td><td align="center">75.0%</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-7B-Chat</td><td align="center">98%</td><td align="center">0.91</td><td align="center">7.3%</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B-Chat</td><td align="center">98%</td><td align="center">0.93</td><td align="center">2.4%</td>
 								    </tr>
 								</table>
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark).
-												first commit

											
										
										
											1 year ago
-												release latest models

											
										
										
											1 year ago
+								We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:
 								<table>
 								    <tr>
 								        <th colspan="4" align="center">Executable Rate of Generated Code (%)</th>
 								    </tr>
 								    <tr>
 								        <th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization↑</th><th align="center">General↑</th>
 								    </tr>
 								    <tr>
 								        <td>GPT-4</td><td align="center">91.9</td><td align="center">85.9</td><td align="center">82.8</td>
 								    </tr>
 								    <tr>
 								        <td>GPT-3.5</td><td align="center">89.2</td><td align="center">65.0</td><td align="center">74.1</td>
 								    </tr>
 								    <tr>
 								        <td>LLaMA2-7B-Chat</td>
 								        <td align="center">41.9</td>
 								        <td align="center">33.1</td>
 								        <td align="center">24.1 </td>
 								    </tr>
 								    <tr>
 								        <td>LLaMA2-13B-Chat</td>
 								        <td align="center">50.0</td>
 								        <td align="center">40.5</td>
 								        <td align="center">48.3 </td>
 								    </tr>
 								    <tr>
 								        <td>CodeLLaMA-7B-Instruct</td>
 								        <td align="center">85.1</td>
 								        <td align="center">54.0</td>
 								        <td align="center">70.7 </td>
 								    </tr>
 								    <tr>
 								        <td>CodeLLaMA-13B-Instruct</td>
 								        <td align="center">93.2</td>
 								        <td align="center">55.8</td>
 								        <td align="center">74.1 </td>
 								    </tr>
 								    <tr>
 								        <td>InternLM-7B-Chat-v1.1</td>
 								        <td align="center">78.4</td>
 								        <td align="center">44.2</td>
 								        <td align="center">62.1 </td>
 								    </tr>
 								    <tr>
 								        <td>InternLM-20B-Chat</td>
 								        <td align="center">70.3</td>
 								        <td align="center">44.2</td>
 								        <td align="center">65.5 </td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-7B-Chat</td>
 								        <td align="center">82.4</td>
 								        <td align="center">64.4</td>
 								        <td align="center">67.2 </td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B-Chat</td>
 								        <td align="center">89.2</td>
 								        <td align="center">84.1</td>
 								        <td align="center">65.5</td>
 								    </tr>
 								</table>
 								<table>
 								    <tr>
 								        <th colspan="4" align="center">Accuracy of Code Execution Results (%)</th>
 								    </tr>
 								    <tr>
 								        <th align="center">Model</th><th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th>
 								    </tr>
 								    <tr>
 								        <td>GPT-4</td><td align="center">82.8</td><td align="center">66.7</td><td align="center">60.8</td>
 								    </tr>
 								    <tr>
 								        <td>GPT-3.5</td><td align="center">47.3</td><td align="center">33.3</td><td align="center">55.7</td>
 								    </tr>
 								    <tr>
 								        <td>LLaMA2-7B-Chat</td>
 								        <td align="center">3.9</td>
 								        <td align="center">14.3</td>
 								        <td align="center">39.2 </td>
 								    </tr>
 								    <tr>
 								        <td>LLaMA2-13B-Chat</td>
 								        <td align="center">8.3</td>
 								        <td align="center">8.3</td>
 								        <td align="center">40.5 </td>
 								    </tr>
 								    <tr>
 								        <td>CodeLLaMA-7B-Instruct</td>
 								        <td align="center">14.3</td>
 								        <td align="center">26.2</td>
 								        <td align="center">60.8 </td>
 								    </tr>
 								    <tr>
 								        <td>CodeLLaMA-13B-Instruct</td>
 								        <td align="center">28.2</td>
 								        <td align="center">27.4</td>
 								        <td align="center">62.0 </td>
 								    </tr>
 								    <tr>
 								        <td>InternLM-7B-Chat-v1.1</td>
 								        <td align="center">28.5</td>
 								        <td align="center">4.8</td>
 								        <td align="center">40.5 </td>
 								    </tr>
 								    <tr>
 								        <td>InternLM-20B-Chat</td>
 								        <td align="center">34.6</td>
 								        <td align="center">21.4</td>
 								        <td align="center">45.6 </td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-7B-Chat</td>
 								        <td align="center">41.9</td>
 								        <td align="center">40.5</td>
 								        <td align="center">54.4 </td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B-Chat</td>
 								        <td align="center">58.4</td>
 								        <td align="center">53.6</td>
 								        <td align="center">59.5</td>
 								    </tr>
 								</table>
 								<p align="center">
 								    <br>
 								    <img src="assets/code_interpreter_showcase_001.jpg" />
 								    <br>
 								<p>
 								In addition, we also provide experimental results demonstrating that our model is capable of acting as a HuggingFace Agent. For more information, please refer to the [example documentation](examples/transformers_agent.md). The model's performance on the evaluation dataset provided by Hugging Face is as follows:
 								<table>
 								    <tr>
 								        <th colspan="4" align="center">HuggingFace Agent Benchmark- Run Mode</th>
 								    </tr>
 								    <tr>
 								        <th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
 								    </tr>
 								    <tr>
 								        <td>GPT-4</td><td align="center">100</td><td align="center">100</td><td align="center">97.4</td>
 								    </tr>
 								    <tr>
 								        <td>GPT-3.5</td><td align="center">95.4</td><td align="center">96.3</td><td align="center">87.0</td>
 								    </tr>
 								    <tr>
 								        <td>StarCoder-Base-15B</td><td align="center">86.1</td><td align="center">87.0</td><td align="center">68.9</td>
 								    </tr>
 								    <tr>
 								        <td>StarCoder-15B</td><td align="center">87.0</td><td align="center">88.0</td><td align="center">68.9</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-7B-Chat</td><td align="center">87.0</td><td align="center">87.0</td><td align="center">71.5</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B-Chat</td><td align="center">93.5</td><td align="center">94.4</td><td align="center">87.0</td>
 								    </tr>
 								</table>
 								<table>
 								    <tr>
 								        <th colspan="4" align="center">HuggingFace Agent Benchmark - Chat Mode</th>
 								    </tr>
 								    <tr>
 								        <th align="center">Model</th><th align="center">Tool Selection↑</th><th align="center">Tool Used↑</th><th align="center">Code↑</th>
 								    </tr>
 								    <tr>
 								        <td>GPT-4</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">98.5</td>
 								    </tr>
 								    <tr>
 								        <td>GPT-3.5</td><td align="center">97.3</td><td align="center">96.8</td><td align="center">89.6</td>
 								    </tr>
 								    <tr>
 								        <td>StarCoder-Base-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">91.1</td>
 								    </tr>
 								    <tr>
 								        <td>StarCoder-15B</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">89.6</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-7B-Chat</td><td align="center">94.7</td><td align="center">94.7</td><td align="center">85.1</td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B-Chat</td><td align="center">97.9</td><td align="center">97.9</td><td align="center">95.5</td>
 								    </tr>
 								</table>
-												first commit

											
										
										
											1 year ago
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
+								<br>
-												first commit

											
										
										
											1 year ago
+								## Long-Context Understanding
-												release latest models

											
										
										
											1 year ago
+								To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-7B/14B from 2k to over 8K tokens, and Qwen-7B from 8k to 32k tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below:
-												first commit

											
										
										
											1 year ago
 								<table>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								        <th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								        <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
 								    </tr>
 								     <tr>
 								        <td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								        <td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								        <td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								        <td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								    <tr>
-												release latest models

											
										
										
											1 year ago
+								    <tr>
 								        <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
 								    </tr>
 								    <tr>
 								        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
 								    </tr>
 								    <tr>
 								        <td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
 								    </tr>
 								    <tr>
 								        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
-												first commit

											
										
										
											1 year ago
+								    </tr>
 								</table>
-												release latest models

											
										
										
											1 year ago
 								## Tokenizer
 								Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
 								<br><br>
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								## Reproduction
-												Update README.md
											
										
										
											1 year ago
+								For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. Note that the reproduction may lead to slight differences from our reported results.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
-												add faq files

											
										
										
											1 year ago
+								## FAQ
-												Update README.md
											
										
										
											1 year ago
+								If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								## License Agreement
-												release latest models

											
										
										
											1 year ago
+								Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
-												update readme

											
										
										
											1 year ago
+								<br><br>
-												update deployment in readme and cli_demo

											
										
										
											1 year ago
-												first commit

											
										
										
											1 year ago
+								## Contact Us
-												release latest models

											
										
										
											1 year ago
+								If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.
-												update readme

											
										
										
											1 year ago