You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
chywang ee01f36ed9
init commit of recipes (#1027)
Add recipes
12 months ago
..
README.md init commit of recipes (#1027) 12 months ago

README.md

Quantization

GPTQ

We provide a solution based on AutoGPTQ, and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

pip install auto-gptq optimum

If you meet problems installing auto-gptq, we advise you to check out the official repo to find a wheel.

Note: The pre-compiled auto-gptq packages strongly depend on the version of torch and its CUDA version. Moreover, due to recent update, you may also encounter unsupported version errors from transformers, optimum, or peft. We recommend using the latest versions meeting the following requirements:

  • torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
  • torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0

Then you can load the quantized model easily and run inference as same as usual:

# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "Hi", history=None)

We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

Quantization MMLU CEval (val) GSM8K Humaneval
Qwen-1.8B-Chat (BF16) 43.3 55.6 33.7 26.2
Qwen-1.8B-Chat (Int8) 43.1 55.8 33.0 27.4
Qwen-1.8B-Chat (Int4) 42.9 52.8 31.2 25.0
Qwen-7B-Chat (BF16) 55.8 59.7 50.3 37.2
Qwen-7B-Chat (Int8) 55.4 59.4 48.3 34.8
Qwen-7B-Chat (Int4) 55.1 59.2 49.7 29.9
Qwen-14B-Chat (BF16) 64.6 69.8 60.1 43.9
Qwen-14B-Chat (Int8) 63.6 68.6 60.0 48.2
Qwen-14B-Chat (Int4) 63.3 69.0 59.8 45.7
Qwen-72B-Chat (BF16) 74.4 80.1 76.4 64.6
Qwen-72B-Chat (Int8) 73.5 80.1 73.5 62.2
Qwen-72B-Chat (Int4) 73.4 80.1 75.3 61.6

Quantization of KV cache

NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality (i.e., cache_autogptq_cuda_256.cpp and cache_autogptq_cuda_kernel_256.cu) may be missing. Please manually download them from the Hugging Face Hub and place them into the same folder as the other module files.

The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments use_cache_quantization and use_cache_kernel in config.json are provided to enable KV cache quantization. The specific use method is as follows:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
     device_map="auto",
     trust_remote_code=True,
     use_cache_quantization=True,
     use_cache_kernel=True,
     use_flash_attn=False
)

Attention: Currently, KV cache quantization and flash attention cannot be used at the same time. If you enable KV cache quantization and flash attention at the same time (use_flash_attn=True, use_cache_quantization=True, use_cache_kernel=True), use_flash_attn is disabled by default (use_flash_attn=false).

We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error.

With KV cache quantization, the model can infer with a larger batch size (bs).

USE KV Cache bs=1 bs=4 bs=16 bs=32 bs=64 bs=100
No 16.3GB 24.1GB 31.7GB 48.7GB OOM OOM
Yes 15.5GB 17.2GB 22.3GB 30.2GB 48.2GB 72.4GB

With KV cache quantization the model can save more memory when generating longer sequence (sl, sequence length, referring to the number of tokens generated) at the stage of inference.

USE KV Cache sl=512 sl=1024 sl=2048 sl=4096 sl=8192
No 15.2GB 16.3GB 17.6GB 19.5GB 23.2GB
Yes 15GB 15.5GB 15.8GB 16.6GB 17.6GB

The model with KV cache quantization will convert the format of layer_past from float to int8, and meanwhile the quantized layer-past will also store the quantization parameters.

Specific steps are as follows:

  1. Quantize key/value
    qv,scale,zero_point=quantize_cache_v(v)
  1. Store into layer_past

The following is the format of quantized layer_past:

    layer_past=((q_key,key_scale,key_zero_point),
                (q_value,value_scale,value_zero_point))

The original format of layer_past is shown below:

    layer_past=(key,value)

If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows:

    v=dequantize_cache_torch(qv,scale,zero_point)