Qwen-7B ð€ | ð€  ïœ Qwen-7B-Chat ð€ | ð€  ïœ  Demo  ïœ  Report   |   Discord
äžæ  ïœ  English  ïœ  æ¥æ¬èª
Japanese document maintainer: Ikko Eltociear Ashimine
ç§ãã¡ã¯ã**Qwen-7B** ãš **Qwen-7B-Chat** ã **ð€ ModelScope** ãš **ð€ Hugging Face** ã®äž¡æ¹ã§ãªãŒãã³ãœãŒã¹åããŠããŸã(äžéšã®ããŽãã¯ãªãã¯ãããšãã³ãŒããšãã§ãã¯ãã€ã³ãã®ãããªããžããªã«ç§»åããŸã)ããã®ã¬ãã«ã¯ãQwen-7B ã®ç°¡åãªçŽ¹ä»ãšã䜿ãæ¹ã®æåŒããããã«è©³ããæ
å ±ãæäŸããæè¡ã¡ã¢ [link](tech_memo.md) ãå«ãŸããŠããŸãã
Qwen-7Bã¯ãã¢ãªããã¯ã©ãŠããæå±ãã倧èŠæš¡èšèªã¢ãã«ã·ãªãŒãºQwenïŒç¥ç§°ïŒTongyi QianwenïŒã®7Bãã©ã¡ãŒã¿çã§ããQwen-7Bã¯TransformerããŒã¹ã®å€§èŠæš¡èšèªã¢ãã«ã§ããããŠã§ãããã¹ããæžç±ãã³ãŒããªã©ãå«ã倧éã®ããŒã¿ã§äºååŠç¿ããããããã«ãäºååŠç¿ãããQwen-7BãããŒã¹ã«ãã¢ã©ã€ã¡ã³ãæè¡ã§åŠç¿ããã倧èŠæš¡ã¢ãã«ããŒã¹ã®AIã¢ã·ã¹ã¿ã³ãã§ããQwen-7B-ChatããªãªãŒã¹ãããQwen-7Bã·ãªãŒãºã®ç¹åŸŽã¯ä»¥äžã®éãã§ã:
1. **é«å質ãªäºåãã¬ãŒãã³ã°ããŒã¿ã§ãã¬ãŒãã³ã°**ãQwen-7B 㯠2.2 å
以äžã®ããŒã¯ã³ãå«ã倧èŠæš¡ã§é«å質ãªããŒã¿ã»ããã«å¯ŸããŠäºååŠç¿ãè¡ã£ãããã®ããŒã¿ã»ããã«ã¯å¹³æãšã³ãŒããå«ãŸããäžè¬çãªãã¡ã€ã³ããŒã¿ãšå°éçãªãã¡ã€ã³ããŒã¿ãå«ãå¹
åºããã¡ã€ã³ãã«ããŒããŠããã
2. **匷ãããã©ãŒãã³ã¹**ãèªç¶èšèªç解ãæ°åŠãã³ãŒãã£ã³ã°ãªã©ãè©äŸ¡ããäžé£ã®ãã³ãããŒã¯ããŒã¿ã»ããã«ãããŠãåçšåºŠã®ã¢ãã«ãµã€ãºã®ã¢ãã«ãšæ¯èŒããŠã競åä»ç€Ÿãåé§ããŠããŸãã
3. **èšèªãµããŒãã®åäž**ãQwen-7B ã®ããŒã¯ãã€ã¶ã¯ã15 äžä»¥äžã®ããŒã¯ã³ã®èªåœãããŒã¹ã«ããŠãããä»ã®ããŒã¯ãã€ã¶ã«æ¯ã¹ãŠå¹ççã§ããå€ãã®èšèªã«å¯Ÿå¿ããŠããããŠãŒã¶ãç¹å®ã®èšèªãç解ããããã« Qwen-7B ãããã«åŸ®èª¿æŽããã®ã«åœ¹ç«ã¡ãŸãã
4. **8K ã³ã³ããã¹ãé·ããµããŒã**ãQwen-7B ãš Qwen-7B-Chat ã¯ãšãã« 8K ã®ã³ã³ããã¹ãé·ããµããŒãããŠãããé·ãã³ã³ããã¹ãã§ã®å
¥åãå¯èœã«ããŠããã
5. **ãã©ã°ã€ã³ã®ãµããŒã**ãQwen-7B-Chat ã¯ããã©ã°ã€ã³é¢é£ã®ã¢ã©ã€ã¡ã³ãããŒã¿ã§ãã¬ãŒãã³ã°ãããŠãããããAPIãã¢ãã«ãããŒã¿ããŒã¹ãªã©ã®ããŒã«ã䜿çšããããšãã§ãããšãŒãžã§ã³ããšããŠãã¬ã€ããããšãã§ããã
以äžã®ã»ã¯ã·ã§ã³ã«ã¯ãåèã«ãªãæ
å ±ãèšèŒãããŠããŸããç¹ã«ãissueãç«ã¡äžããåã«FAQã»ã¯ã·ã§ã³ããèªã¿ã«ãªãããšããå§ãããŸãã
## ãã¥ãŒã¹
* 2023.8.3 Qwen-7B ãš Qwen-7B-Chat ã ModelScope ãš Hugging Face ã§å
¬éããŸãããã¬ãŒãã³ã°ã®è©³çŽ°ãã¢ãã«ã®æ§èœãªã©ãã¢ãã«ã®è©³çŽ°ã«ã€ããŠã¯ãã¯ãã«ã«ã¡ã¢ãæäŸããŠããŸãã
## ããã©ãŒãã³ã¹
äžè¬çã«ãQwen-7B ã¯ãMMLUãC-EvalãGSM8KãHumanEvalãWMT22ãCMMLU ãªã©ã®èªç¶èšèªç解ãæ°åŠçåé¡è§£æ±ºãã³ãŒãã£ã³ã°ãªã©ã«é¢ããã¢ãã«ã®èœåãè©äŸ¡ããäžé£ã®ãã³ãããŒã¯ããŒã¿ã»ããã«ãããŠãåçšåºŠã®ã¢ãã«ãµã€ãºã®ããŒã¹ã©ã€ã³ã¢ãã«ãåé§ããããã«ã¯ 13B çšåºŠã®ãã©ã¡ãŒã¿ãæã€ãã倧èŠæš¡ãªã¢ãã«ããåé§ããŠããã以äžã®çµæãã芧ãã ããã
| Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
| LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
| LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
| Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
| InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
| Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
| LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
| LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
| ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
| **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
ããã«ã[OpenCompass](https://opencompass.org.cn/leaderboard-llm)ãå®æœãã倧èŠæš¡èšèªã¢ãã«ã®ç¬¬äžè
è©äŸ¡ã«ãããšãQwen-7BãšQwen-7B-Chatã¯7Bãã©ã¡ãŒã¿ã¢ãã«ã®ãããã§ããããã®è©äŸ¡ã¯ãèšèªç解ã»çæãã³ãŒãã£ã³ã°ãæ°åŠãæšè«ãªã©ã®è©äŸ¡ã®ããã®å€§éã®å
¬éãã³ãããŒã¯ã§æ§æãããŠããã
ãã詳现ãªå®éšçµæïŒããå€ãã®ãã³ãããŒã¯ããŒã¿ã»ããã§ã®è©³çŽ°ãªã¢ãã«æ§èœïŒã詳现ã«ã€ããŠã¯ã[ãã¡ã](tech_memo.md)ãã¯ãªãã¯ããŠæè¡ã¡ã¢ãåç
§ããŠãã ããã
## å¿
èŠæ¡ä»¶
* python 3.8 以äž
* pytorch 1.12 以äžã2.0 以äžãæšå¥š
* CUDA 11.4 以äžãæšå¥šïŒGPU ãŠãŒã¶ãŒããã©ãã·ã¥ã¢ãã³ã·ã§ã³ãŠãŒã¶ãŒåããªã©ïŒ
## ã¯ã€ãã¯ã¹ã¿ãŒã
以äžã§ã¯ãQwen-7B ãš ð€ ModelScope ãš ð€ Transformers ã®ç°¡åãªäœ¿çšäŸã瀺ããŸãã
ã³ãŒããå®è¡ããåã«ãç°å¢ã®ã»ããã¢ãããšå¿
èŠãªããã±ãŒãžã®ã€ã³ã¹ããŒã«ãæžãã§ããããšã確èªããŠãã ãããäžèšã®èŠä»¶ãæºãããŠããããšã確èªããŠãããäŸåããã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŠãã ããã
```bash
pip install -r requirements.txt
```
ã䜿ãã®ããã€ã¹ã fp16 ãŸã㯠bf16 ããµããŒãããŠããå Žåã[flash-attention](https://github.com/Dao-AILab/flash-attention) ãã€ã³ã¹ããŒã«ããããšã§ãããé«ãå¹çãšã¡ã¢ãªäœ¿çšéãæããããšãã§ããŸãã(**flash-attention ã¯ãªãã·ã§ã³ã§ãããã€ã³ã¹ããŒã«ããªããŠããããžã§ã¯ãã¯æ£åžžã«å®è¡ã§ããŸã**)
```bash
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 以äžã¯ãªãã·ã§ã³ã§ããã€ã³ã¹ããŒã«ã«æéããããå ŽåããããŸãã
# pip install csrc/layer_norm
# pip install csrc/rotary
```
ãã㧠ModelScope ã Transformers ã§å§ããããšãã§ããŸãã
#### ð€ Transformers
Qwen-7B-Chat ãæšè«ã«äœ¿çšããã«ã¯ã以äžã®ããã«æ°è¡ã®ã³ãŒããå
¥åããã ãã§ãã**ææ°ã®ã³ãŒãã䜿çšããŠããããšã確èªããŠãã ããã**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 泚: ããã©ã«ãã®åäœã§ã¯ãã€ã³ãžã§ã¯ã·ã§ã³æ»æé²æ¢æ©èœããªãã«ãªã£ãŠããŸãã
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "äœ å¥œ", history=None)
print(response)
# ããã«ã¡ã¯ïŒ ã圹ã«ç«ãŠãŠããããã§ãã
# 第äºèœ®å¯¹è¯ 第äºå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»æ讲äžäžªå¹Žèœ»äººå¥æåäžæç»ååŸæåçæ
äºã", history=history)
print(response)
# ããã¯ãèªåã®ããžãã¹ãå§ããããšå¥®éãããããŠæåããè¥è
ã®ç©èªã§ããã
# ãã®ç©èªã®äž»äººå
¬ã¯ãå¹³å¡ãªå®¶åºã«çãŸããå¹³å¡ãªåŽåè
ã§ãã䞡芪ãæã€ææã§ããã ææã¯åäŸã®é ããèµ·æ¥å®¶ãšããŠæåããããšãç®æšãšããŠããã
# ãã®ç®æšãéæãããããææã¯çå匷ããŠå€§åŠã«å
¥ã£ãã 倧åŠæ代ã«ã¯ãããŸããŸãªèµ·æ¥å®¶ã³ã³ãã¹ãã«ç©æ¥µçã«åå ããå€ãã®è³ãç²åŸããã ãŸããäœæãå©çšããŠã€ã³ã¿ãŒã³ã·ããã«ãåå ãã貎éãªçµéšãç©ãã ã
# åæ¥åŸãææã¯èµ·æ¥ã決æããã æè³å
ãæ¢ãå§ããããäœåºŠãæãããã ãããã圌ã¯ãããããªãã£ãã 圌ã¯æžåœã«åãç¶ããããžãã¹ãã©ã³ãæ¹åããæ°ããªæè³æ©äŒãæ¢ããã
# ãããŠææã¯æè³ãåããããšã«æåããèªåã®ããžãã¹ãå§ããã 圌ã¯æ°ããã¿ã€ãã®ãœãããŠã§ã¢ã®éçºã«çŠç¹ãåœãŠããã¯ãããžãŒäŒç€Ÿãèšç«ããã 圌ã®ãªãŒããŒã·ããã®äžãäŒç€Ÿã¯æ¥éã«æé·ãããã¯ãããžãŒäŒæ¥ãšããŠæåãåããã
# ææã®æåã¯å¶ç¶ã§ã¯ãªãã 圌ã¯å€åã§ããããŸãããåéºå¥œãã§ãåžžã«åŠã³ãèªåãé«ããŠããã 圌ã®æåã¯ãŸããåªåããã°èª°ã§ãæåã§ããããšã蚌æããŠããã
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»è¿äžªæ
äºèµ·äžäžªæ é¢", history=history)
print(response)
# ãèµ·æ¥ãžã®å¥®éïŒããè¥è
ã®æåãžã®éã
```
Qwen-7B ã®åŠç¿æžã¿ããŒã¹ã¢ãã«ã®å®è¡ãç°¡åã§ãã
Qwen-7B ã®å®è¡
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯ã¢ãã£ã¹ã¢ããïŒAddis AbabaïŒ...
```
#### ð€ ModelScope
ModelScope ã¯ãMaaSïŒModel-as-a-ServiceïŒ ã®ããã®ãªãŒãã³ãœãŒã¹ãã©ãããã©ãŒã ã§ãããAI éçºè
ã«æè»ã§è²»çšå¯Ÿå¹æã®é«ãã¢ãã«ãµãŒãã¹ãæäŸããŸããåæ§ã«ã以äžã®ããã« ModelScope ã§ã¢ãã«ãå®è¡ããããšãã§ããŸã:
```python
import os
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope import snapshot_download
model_id = 'QWen/qwen-7b-chat'
revision = 'v1.0.0'
model_dir = snapshot_download(model_id, revision)
pipe = pipeline(
task=Tasks.chat, model=model_dir, device_map='auto')
history = None
text = 'æµæ±çã®çéœã¯ã©ãã§ããïŒ'
results = pipe(text, history=history)
response, history = results['response'], results['history']
print(f'Response: {response}')
text = 'äœããããªã«é¢çœãã®ãïŒ'
results = pipe(text, history=history)
response, history = results['response'], results['history']
print(f'Response: {response}')
```
## ããŒã¯ãã€ã¶ãŒ
tiktoken ã«åºã¥ãããŒã¯ãã€ã¶ãŒã¯ãä»ã®ããŒã¯ãã€ã¶ãŒãäŸãã°ã»ã³ãã³ã¹ããŒã¹ããŒã¯ãã€ã¶ãŒãšã¯ç°ãªããŸããç¹ã«ãã¡ã€ã³ãã¥ãŒãã³ã°ã®éã«ã¯ãç¹æ®ãªããŒã¯ã³ã«æ³šæãæãå¿
èŠããããŸããããŒã¯ãã€ã¶ã«é¢ãã詳现ãªæ
å ±ãããã¡ã€ã³ãã¥ãŒãã³ã°ã«ããã䜿çšæ¹æ³ã«ã€ããŠã¯ã[ããã¥ã¡ã³ã](tokenization_note.md)ãåç
§ããŠãã ããã
## éåå
`NF4` ãš `Int8` ã®ã¢ãã«ãããŒãããæ¹æ³ã瀺ãäŸãæäŸããŸããæå§ãã«ã`bitsandbytes` ãå®è£
ãããŠããããšã確èªããŠäžããã`bitsandbytes` ã®èŠä»¶ã¯ä»¥äžã®éãã«ãªããŸã:
```
**å¿
èŠæ¡ä»¶** Python >= 3.8ãLinux ãã£ã¹ããªãã¥ãŒã·ã§ã³ïŒUbuntuãMacOS ãªã©ïŒ+ CUDA > 10.0ã
```
ãããŠã以äžã®ã³ãã³ããå®è¡ã㊠`bitsandbytes` ãã€ã³ã¹ããŒã«ããïŒ
```
pip install bitsandbytes
```
Windows ãŠãŒã¶ã¯ã[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) ãšããå¥ã®ãªãã·ã§ã³ãèŠã€ããå¿
èŠããããŸãã
ãããŠãéååã®èšå®ã `AutoModelForCausalLM.from_pretrained` ã«è¿œå ããã ããšãªããŸãã以äžã®äŸãåç
§ããŠãã ãã:
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# NF4ïŒ4ãããïŒã®éååèšå®
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
# Int8ïŒ8ãããïŒã®éååèšå®
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map="cuda:0",
quantization_config=quantization_config,
max_memory=max_memory,
trust_remote_code=True,
).eval()
```
ãã®æ¹æ³ã§ã¯ãQwen-7B ã `NF4` ãš `Int8` ã§ããŒãããããšãã§ããã¡ã¢ãªäœ¿çšéãç¯çŽã§ããã以äžã«ã¢ãã«æ§èœã®é¢é£çµ±èšéã瀺ããŸããéååã«ãããæå¹æ§ã¯è¥å¹²äœäžããããæšè«å¹çã¯å€§å¹
ã«åäžããã¡ã¢ãªã³ã¹ããåæžãããããšãããããŸãã
| Precision | MMLU | GPU Memory for Loading Model |
| ----------- | :------: | :---------------------------: |
| BF16 | 56.7 | 16.38G |
| Int8 | 52.8 | 10.44G |
| NF4 | 48.9 | 7.79G |
泚ïŒäžè¡šã®GPUã¡ã¢ãªäœ¿çšéãããã¡ã€ãªã³ã°ã¯ãã·ã³ã°ã«A100-SXM4-80G GPUãPyTorch 2.0.1ãCUDA 11.8ããã©ãã·ã¥ã¢ãã³ã·ã§ã³äœ¿çšã§å®è¡ãããŠããŸãã
## æšè«å¹ç
### æšè«ã¹ããŒã
BF16粟床ãéååã¬ãã«Int8ãŸãã¯NF4ã§ããããã2KããŒã¯ã³ãçæããå¹³åæšè«é床ã枬å®ããã
| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
| ------ | :---------------------------: | :---------------------------: |
| BF16 (no quantization) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |
詳现ã«ã¯ããããã¡ã€ãªã³ã°ã®èšå®ã¯ã1ã³ã³ãã¯ã¹ãã»ããŒã¯ã³ã§2048ã®æ°ããããŒã¯ã³ãçæããŠããããããã¡ã€ãªã³ã°ã¯ãPyTorch 2.0.1ãšCUDA 11.8ãæèŒããã·ã³ã°ã«A100-SXM4-80G GPUã§å®è¡ããããæšè«é床ã¯çæããã2048åã®ããŒã¯ã³ã®å¹³åã§ãã
### GPUã¡ã¢ãªäœ¿çšé
ãŸããBF16ãŸãã¯Int8/NF4éååã¬ãã«ã®äžã§ã2048åã®ããŒã¯ã³ãã³ã³ããã¹ããšããŠãšã³ã³ãŒãããå ŽåïŒããã³åäžã®ããŒã¯ã³ãçæããå ŽåïŒãšã8192åã®ããŒã¯ã³ãçæããå ŽåïŒåäžã®ããŒã¯ã³ãã³ã³ããã¹ããšããŠçæããå ŽåïŒã®GPUã¡ã¢ãªäœ¿çšéã®ããŒã¯å€ããããããããã¡ã€ãªã³ã°ããŸãããçµæã以äžã«ç€ºãã
Flash attentionã䜿çšããå Žåã®ã¡ã¢ãªäœ¿çšéã¯ä»¥äžã®éãã§ããïŒ
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |
Flash attentionã䜿çšããªãå Žåãã¡ã¢ãªäœ¿çšéã¯æ¬¡ã®ããã«ãªãïŒ
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |
äžèšã®ã¹ããŒããšã¡ã¢ãªãŒã®ãããã¡ã€ãªã³ã°ã¯ã[ãã®ã¹ã¯ãªãã](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)ã䜿ã£ãŠè¡ãããã
## ãã¢
### CLI ãã¢
`cli_demo.py` ã« CLI ã®ãã¢äŸãçšæããŠããŸãããŠãŒã¶ã¯ããã³ãããå
¥åããããšã§ Qwen-7B-Chat ãšå¯Ÿè©±ããããšãã§ããã¢ãã«ã¯ã¹ããªãŒãã³ã°ã¢ãŒãã§ã¢ãã«ã®åºåãè¿ããŸãã以äžã®ã³ãã³ããå®è¡ããïŒ
```
python cli_demo.py
```
### ãŠã§ã UI
ãŠã§ãUIãã¢ãæ§ç¯ããããã®ã³ãŒããæäŸããŸãïŒ@wysaidã«æè¬ïŒãå§ããåã«ã以äžã®ããã±ãŒãžãã€ã³ã¹ããŒã«ãããŠããããšã確èªããŠãã ããïŒ
```
pip install -r requirements_web_demo.txt
```
ãããŠã以äžã®ã³ãã³ããå®è¡ããçæããããªã³ã¯ãã¯ãªãã¯ããïŒ
```
python web_demo.py
```
## API
OpenAI APIãããŒã¹ã«ããŒã«ã«APIããããã€ããæ¹æ³ãæäŸããïŒ@hanpenggitã«æè¬ïŒãå§ããåã«ãå¿
èŠãªããã±ãŒãžãã€ã³ã¹ããŒã«ããŠãã ããïŒ
```bash
pip install fastapi uvicorn openai pydantic sse_starlette
```
ãããããAPIããããã€ããã³ãã³ããå®è¡ããïŒ
```bash
python openai_api.py
```
ãã§ãã¯ãã€ã³ãåããã¹ã«ã¯ `-c` ãCPU ãããã€ã¡ã³ãã«ã¯ `--cpu-only` ãªã©ãåŒæ°ãå€æŽã§ããŸããAPIãããã€ã¡ã³ããèµ·åããéã«åé¡ãçºçããå Žåã¯ãããã±ãŒãžãææ°ããŒãžã§ã³ã«æŽæ°ããããšã§è§£æ±ºã§ããå¯èœæ§ããããŸãã
APIã®äœ¿ãæ¹ãç°¡åã ã以äžã®äŸãã芧ãã ããïŒ
```python
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
for chunk in openai.ChatCompletion.create(
model="Qwen-7B",
messages=[
{"role": "user", "content": "äœ å¥œ"}
],
stream=True
):
if hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="", flush=True)
```
## ããŒã«ã®äœ¿çš
Qwen-7B-Chat ã¯ãAPIãããŒã¿ããŒã¹ãã¢ãã«ãªã©ãããŒã«ã®å©çšã«ç¹åããŠæé©åãããŠããããŠãŒã¶ã¯ç¬èªã® Qwen-7B ããŒã¹ã® LangChainããšãŒãžã§ã³ããã³ãŒãã€ã³ã¿ããªã¿ãæ§ç¯ããããšãã§ããŸããããŒã«å©çšèœåãè©äŸ¡ããããã®è©äŸ¡[ãã³ãããŒã¯](eval/EVALUATION.md)ã§ã¯ãQwen-7B ã¯å®å®ããæ§èœã«éããŠããŸãã
[](https://)
| Model | Tool Selection (Acc.â) | Tool Input (Rouge-Lâ) | False Positive Errorâ |
|:------------|:----------------------:|:----------------------:|:----------------------:|
| GPT-4 | 95% | **0.90** | 15% |
| GPT-3.5 | 85% | 0.88 | 75% |
| **Qwen-7B** | **99%** | 0.89 | **9.7%** |
ReAct ããã³ããã®æžãæ¹ã䜿ãæ¹ã«ã€ããŠã¯ã[ReAct ã®äŸ](examples/react_prompt.md)ãåç
§ããŠãã ãããããŒã«ã䜿çšããããšã§ãã¢ãã«ãããããã¿ã¹ã¯ãå®è¡ã§ããããã«ãªããŸãã
ããã«ããšãŒãžã§ã³ããšããŠã®èœåã瀺ãå®éšçµæãæäŸããã詳现㯠[Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) ãåç
§ãHugging Face ãæäŸããã©ã³ã¢ãŒããã³ãããŒã¯ã§ã®æ§èœã¯ä»¥äžã®éãã§ã:
| Model | Tool Selectionâ | Tool Usedâ | Codeâ |
|:---------------|:---------------:|:-----------:|:---------:|
|GPT-4 | **100** | **100** | **97.41** |
|GPT-3.5 | 95.37 | 96.30 | 87.04 |
|StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
| **Qwen-7B** | 90.74 | 92.59 | 74.07 |
## é·ãæèã®ç解
ã³ã³ããã¹ãã®é·ããæ¡åŒµããèšç·Žã·ãŒã±ã³ã¹ã®é·ãã®ããã«ããã¯ã解æ¶ããããã«ãNTK ãèæ
®ããè£éããŠã£ã³ããŠã¢ãã³ã·ã§ã³ãLogN ã¢ãã³ã·ã§ã³ã¹ã±ãŒãªã³ã°ãªã©ã®æè¡ãå°å
¥ããã³ã³ããã¹ãã®é·ãã 8K ããŒã¯ã³ä»¥äžã«æ¡åŒµãããarXiv ããŒã¿ã»ãããçšã㊠PPL è©äŸ¡ã«ããèšèªã¢ããªã³ã°å®éšãè¡ããQwen-7B ãé·ãã³ã³ããã¹ãã®ã·ããªãªã«ãããŠåè¶ããæ§èœãéæã§ããããšãèŠåºããã以äžã«çµæã瀺ããŸã:
Model | Sequence Length |
1024 | 2048 | 4096 | 8192 | 16384 |
Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
+ dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
+ dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
+ dynamic_ntk + logn + window_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |
## åçŸ
ãã³ãããŒã¯ããŒã¿ã»ããã§ã®ã¢ãã«æ§èœã®åçŸã®ããã«ãçµæãåçŸããã¹ã¯ãªãããæäŸããŠããŸãã詳ãã㯠[eval/EVALUATION.md](eval/EVALUATION.md) ã確èªããŠãã ããããªããåçŸã®çµæãæã
ã®å ±åçµæãšè¥å¹²ç°ãªãå Žåãããã
## FAQ
åé¡ãçºçããå Žåã¯ã[FAQ](FAQ.md)ãissueãåç
§ããæ°ããissueãç«ã¡äžããåã«è§£æ±ºçãæ¢ããŠãã ããã
## ã©ã€ã»ã³ã¹å¥çŽ
Qwen-7B ãš Qwen-7B-Chat ã®ã³ãŒããšã¢ãã«ãŠã§ã€ãã¯ãç 究è
ãéçºè
ãèªç±ã«äœ¿çšããããšãã§ããŸãããŸããåçšå©çšãå¯èœã§ãã詳ãã㯠[LICENSE](LICENSE) ãã芧ãã ãããåçšå©çšãåžæãããæ¹ã¯ã[ãªã¯ãšã¹ããã©ãŒã ](https://dashscope.console.aliyun.com/openModelApply/qianwen)ã«å¿
èŠäºé
ããèšå
¥ã®äžããç³ã蟌ã¿ãã ããã
## ãåãåãã
ç 究ããŒã ãŸãã¯è£œåããŒã ãžã®ã¡ãã»ãŒãžã¯ãqianwen_opensource@alibabacloud.com ãŸã§ãæ°è»œã«ãéããã ããã