You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
贺弘 99b13b4fd2 add trt docker file && add warning for ascend 12 months ago
..
docker add trt docker file && add warning for ascend 12 months ago
README.md init commit of recipes (#1027) 12 months ago

README.md

Inference Qwen Using TensorRT-LLM

Below, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We recommend using GPUs with compute capability of at least SM_80 such as A10 and A800 to run this example, as we have tested on these GPUs. You can find your gpu compute capability on this link.

Installation

You can use pre-built docker image to run this example. Simultaneously, You can also refer to the official TensorRT-LLM for installation and detailed usage.

docker run --gpus all -it --ipc=host --network=host pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:tensorrt-llm-0.8.0 bash

Quickstart

  1. Download model by modelscope
cd TensorRT-LLM/examples/qwen
python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')"
mkdir -p ./tmp/Qwen
mv Qwen/Qwen-1_8B-Chat ./tmp/Qwen/1_8B
  1. Build TensorRT engine from HF checkpoint
python3 build.py --hf_model_dir ./tmp/Qwen/1_8B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu/
  1. Inference
python3 ../run.py --input_text "你好,请问你叫什么?" \
                  --max_output_len=512 \
                  --tokenizer_dir ./tmp/Qwen/1_8B/ \
                  --engine_dir=./tmp/Qwen/1_8B/trt_engines/fp16/1-gpu
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好,请问你叫什么?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "你好,我是来自阿里云的大规模语言模型,我叫通义千问。"