diff --git a/README.md b/README.md
index 2a644fd..33b70f9 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-
+span
@@ -6,7 +6,7 @@
- Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report   |   Discord + Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  | Qwen-7B-Chat-Int4 🤗  |  Demo  |  Report   |   Discord
@@ -195,93 +196,65 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
## Quantization
-We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
+### Usage
-```
-**Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
-```
+**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**
-Then run the following command to install `bitsandbytes`:
+Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
```
-pip install bitsandbytes
-```
-
-Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
-Then you only need to add your quantization configuration to `AutoModelForCausalLM.from_pretrained`. See the example below:
+Then you can load the quantized model easily as shown below:
```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-# quantization configuration for NF4 (4 bits)
-quantization_config = BitsAndBytesConfig(
- load_in_4bit=True,
- bnb_4bit_quant_type='nf4',
- bnb_4bit_compute_dtype=torch.bfloat16
-)
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
+```
-# quantization configuration for Int8 (8 bits)
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:
-model = AutoModelForCausalLM.from_pretrained(
- args.checkpoint_path,
- device_map="cuda:0",
- quantization_config=quantization_config,
- max_memory=max_memory,
- trust_remote_code=True,
-).eval()
+```python
+from transformers import GenerationConfig
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
```
-With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.
+### Performance
-| Precision | MMLU | GPU Memory for Loading Model |
-| ----------- | :------: | :---------------------------: |
-| BF16 | 56.7 | 16.38G |
-| Int8 | 52.8 | 10.44G |
-| NF4 | 48.9 | 7.79G |
+We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
-Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and CUDA 11.8, with flash attention used.
-
-## Inference Efficiency
+| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
+| -------------- | :----: | :-----------: | :-----: | :---------: |
+| BF16 | 53.9 | 54.2 | 41.1 | 24.4 |
+| Int4 | 52.6 | 52.9 | 38.1 | 23.8 |
### Inference Speed
-We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
+We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.
-| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
-| ---------------------- | :----------------------------------------: | :---------------------------------------: |
-| BF16 (no quantization) | 30.06 | 27.55 |
-| Int8 (bnb) | 7.94 | 7.86 |
-| NF4 (bnb) | 21.43 | 20.37 |
+| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
+| -------------- | :-------------------: | :-------------------: |
+| BF16 | 30.53 | 28.51 |
+| Int4 | 45.60 | 33.83 |
-In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.
+In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
### GPU Memory Usage
-We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below.
-
-When using flash attention, the memory usage is:
+We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
-| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| ------------------ | :---------------------------------: | :-----------------------------------: |
-| BF16 | 18.11GB | 23.52GB |
-| Int8 | 12.17GB | 17.60GB |
-| NF4 | 9.52GB | 14.93GB |
-
-When not using flash attention, the memory usage is:
-
-| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| ------------------ | :---------------------------------: | :-----------------------------------: |
-| BF16 | 18.11GB | 24.40GB |
-| Int8 | 12.18GB | 18.47GB |
-| NF4 | 9.52GB | 15.81GB |
+| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| -------------- | :-----------------------------------: | :-------------------------------------: |
+| BF16 | 18.99GB | 24.40GB |
+| In4 | 10.20GB | 15.61GB |
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
## Demo
-
### Web UI
We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
@@ -371,22 +344,22 @@ print(response.choices[0].message.content)
Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
-| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
-|:------------|:----------------------:|:----------------------:|:----------------------:|
-| GPT-4 | 95% | **0.90** | 15% |
-| GPT-3.5 | 85% | 0.88 | 75% |
-| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
+| Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
+| :------------ | :-----------------------: | :----------------------: | :----------------------: |
+| GPT-4 | 95% | **0.90** | 15% |
+| GPT-3.5 | 85% | 0.88 | 75% |
+| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.
Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows:
-| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-|:---------------|:---------------:|:-----------:|:---------:|
-|GPT-4 | **100** | **100** | **97.41** |
-|GPT-3.5 | 95.37 | 96.30 | 87.04 |
-|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
-| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
+| Model | Tool Selection↑ | Tool Used↑ | Code↑ |
+| :---------------- | :----------------: | :-----------: | :---------: |
+| GPT-4 | **100** | **100** | **97.41** |
+| GPT-3.5 | 95.37 | 96.30 | 87.04 |
+| StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
+| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
## Long-Context Understanding
diff --git a/README_CN.md b/README_CN.md
index af4d8f9..5e00be4 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -6,7 +6,7 @@
- Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report   |   Discord + Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  | Qwen-7B-Chat-Int4 🤗  |  Demo  |  Report   |   Discord
-
### 交互式Demo
我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令:
diff --git a/README_JA.md b/README_JA.md
index f178493..008037c 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -6,7 +6,7 @@
- Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  |  Demo  |  Report   |   Discord + Qwen-7B 🤖 | 🤗  | Qwen-7B-Chat 🤖 | 🤗  | Qwen-7B-Chat-Int4 🤗  |  Demo  |  Report   |   Discord