You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

History

贺弘 99b13b4fd2 add trt docker file && add warning for ascend		1 year ago
..
docker	add trt docker file && add warning for ascend	1 year ago
README.md	init commit of recipes (#1027 )	1 year ago

README.md

Inference Qwen Using TensorRT-LLM

Below, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We recommend using GPUs with compute capability of at least SM_80 such as A10 and A800 to run this example, as we have tested on these GPUs. You can find your gpu compute capability on this link.

Installation

You can use pre-built docker image to run this example. Simultaneously, You can also refer to the official TensorRT-LLM for installation and detailed usage.

docker run --gpus all -it --ipc=host --network=host pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:tensorrt-llm-0.8.0 bash

Quickstart

Download model by modelscope

cd TensorRT-LLM/examples/qwen
python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
mkdir -p ./tmp/Qwen
mv Qwen/Qwen-1_8B-Chat ./tmp/Qwen/1_8B

Build TensorRT engine from HF checkpoint

python3 build.py --hf_model_dir ./tmp/Qwen/1_8B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu/

Inference

python3 ../run.py --input_text "你好，请问你叫什么？" \
                  --max_output_len=512 \
                  --tokenizer_dir ./tmp/Qwen/1_8B/ \
                  --engine_dir=./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu

Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好，请问你叫什么？<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "你好，我是来自阿里云的大规模语言模型，我叫通义千问。"