You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
99b13b4fd2 | 12 months ago | |
---|---|---|
.. | ||
docker | 12 months ago | |
README.md | 12 months ago |
README.md
Inference Qwen Using TensorRT-LLM
Below, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We recommend using GPUs with compute capability of at least SM_80 such as A10 and A800 to run this example, as we have tested on these GPUs. You can find your gpu compute capability on this link.
Installation
You can use pre-built docker image to run this example. Simultaneously, You can also refer to the official TensorRT-LLM for installation and detailed usage.
docker run --gpus all -it --ipc=host --network=host pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:tensorrt-llm-0.8.0 bash
Quickstart
- Download model by modelscope
cd TensorRT-LLM/examples/qwen
python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
mkdir -p ./tmp/Qwen
mv Qwen/Qwen-1_8B-Chat ./tmp/Qwen/1_8B
- Build TensorRT engine from HF checkpoint
python3 build.py --hf_model_dir ./tmp/Qwen/1_8B/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir ./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu/
- Inference
python3 ../run.py --input_text "你好,请问你叫什么?" \
--max_output_len=512 \
--tokenizer_dir ./tmp/Qwen/1_8B/ \
--engine_dir=./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好,请问你叫什么?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "你好,我是来自阿里云的大规模语言模型,我叫通义千问。"